Monday, September 24, 2012

utm, no escape and irregular expressions


If you're doing work online and hang around marketing people enough, at some point "utm" might enter your world. Your landing pages will get hit with URL parameters like

utm_source=google&utm_medium=cpc&utm_term=some%25252Bgreat%25252Bkeywords&utm_campaign=Program

Beautiful, huh? ("UTM" stands for "Urchin Traffic Monitor", as Urchin was an early Google purchase and is at the core of their marketing tools.)

We were trying to use the utm_term variable, grabbing the keywords that triggered our ad (that the user then clicked) to appear on a web search. We wanted to parse the keywords to plug into our own local search engine.

I was aware aware of the %2B encoding... that's the hex code for a plus, and a plus is used to represent a space character, since URLs aren't supposed to have spaces. The double encoding (space to plus to hex) seemed a bit of overkill, but whatever...

But it wasn't enough! Searches were failing and it wasn't clear why... using our log browsing tool "Splunk" (what a name!) I grabbed log data to get the actual URLs, and found they our primitive "replace %2B with space" routine wasn't cutting it, because of beauties like "%252B" and even "%25252b" showing up. %25 is the code for the percent sign itself. So these guys weren't just doing double encoding, but triple and quadruple encoding!  Yeesh. (Meme: "Yo Dawg, we heard you like encoding, so we encoded your encoding so you can escape while you escape!")

I'm not sure if there's a handy library that would more properly do the un-escaping, but a bit of playing with javascript and actual data leads me to believe this regex should do the trick:

function removeDoubleEncodedSpaces(val){
return val.replace(/\%(25)*2b/gi," ");
}
So the pattern was a literal percent sign, 0 or more "25"s, then ending with a literal "2b", and I wanted to replace all of those with space, and do it in a case insensitive way. (Sometimes being an old Perl geek has its advantages.)

UPDATE: a coworker pointed out I could use decodeURIComponent(), a javascript built in. But I'd have to apply it 3 or 4 times in this case... I'm not sure if there's a while loop that would make sure all the encoding was taken care of.

No comments:

Post a Comment