Monday, September 24, 2012
utm, no escape and irregular expressions
If you're doing work online and hang around marketing people enough, at some point "utm" might enter your world. Your landing pages will get hit with URL parameters like
Beautiful, huh? ("UTM" stands for "Urchin Traffic Monitor", as Urchin was an early Google purchase and is at the core of their marketing tools.)
We were trying to use the utm_term variable, grabbing the keywords that triggered our ad (that the user then clicked) to appear on a web search. We wanted to parse the keywords to plug into our own local search engine.
I was aware aware of the %2B encoding... that's the hex code for a plus, and a plus is used to represent a space character, since URLs aren't supposed to have spaces. The double encoding (space to plus to hex) seemed a bit of overkill, but whatever...
But it wasn't enough! Searches were failing and it wasn't clear why... using our log browsing tool "Splunk" (what a name!) I grabbed log data to get the actual URLs, and found they our primitive "replace %2B with space" routine wasn't cutting it, because of beauties like "%252B" and even "%25252b" showing up. %25 is the code for the percent sign itself. So these guys weren't just doing double encoding, but triple and quadruple encoding! Yeesh. (Meme: "Yo Dawg, we heard you like encoding, so we encoded your encoding so you can escape while you escape!")
return val.replace(/\%(25)*2b/gi," ");
So the pattern was a literal percent sign, 0 or more "25"s, then ending with a literal "2b", and I wanted to replace all of those with space, and do it in a case insensitive way. (Sometimes being an old Perl geek has its advantages.)