Wednesday, January 11, 2023

regex-like stuff manually is kind of tough!

The other month I wrote about tuning my (admittedly bare-bones) UI to my homebrew CMS for my website - upgrading the previous behavior where highlighted text could be changed into a link (and heuristically figuring out if the highlighted text was a URL, or should be used as the clickable text for a link) so that ALL URL-looking strings that weren't already part of a clickable link would be linkified.

But the regex I thought of for that (URLs start with http:// or https:// , end with white space or the end of string, but don't match if the character before is a single or double quote) was beyond me.

A friend on a private Slack (after apolgetically reminding me of the famous You can't parse [X]HTML with regex Stackoverflow) gave me a version that worked using a "negative lookbehind". But, lo and behold, THE DANG THING DOESN'T WORK ON SAFARI. (I realized, trying to figure out why it wasn't working on my iPhone, where it promised to be most useful...)

When I pointed that out to my friend, he grumbled about that's why he steered his career the hell away from frontend dev, which is a fair point. It's the ubiquity of javascript in browsers that powers much of its popularity, and we're past the age when we needed JQuery to iron out the differences, but still, you're kind of at the mercy of the browser for this kind of thing.

Anyway, I decided to try and do the matching and rewriting "by hand" - it turned out to be kind of a fun "CompSci class" problem - aided a bit with the new ECMAScript array functions, which actually guided me to a more modular design, rather than trying to do it all in one big wacky loop.

here's a codepen for it - the steps were:

1. Find the indices of everything starting with https:// 

2. Filter out those matches when the character before the match is a single or double quote

3. Map those matches from simple offsets to [offset, length of string until whitespace or end of string]

4. reverse the list of those [offset, length]  (since we're going to be operating on a string in place, we start at the end so that earlier offsets are still valid) and for each wrap the url with <a href=" before ad "></a> after.

It was nice that I was still able to use my earlier "testLinkify()" function to make sure I was hitting some edge cases properly (I think there might still be some weird edge cases lurking but nothing I'd hit in normal use)


function testLinkify(){ console.log('starting test---------------------------'); test(`https://foo.com`, `<a href="https://foo.com"></a>`); test(`\nhttps://foo.com`,`\n<a href="https://foo.com"></a>`); test(`<a href="https://foo.com"></a>`,`<a href="https://foo.com"></a>`); test(`<a href="https://foo.com"></a> https://bar.com`,`<a href="https://foo.com"></a> <a href="https://bar.com"></a>`); test(`https://foo.com\nhttps://bar.com <a href="https://baz.com">BAZ</a>`,`<a href="https://foo.com"></a>\n<a href="https://bar.com"></a> <a href="https://baz.com">BAZ</a>`); } function test(input,expect){ const testFunction = linkifyBareHttps; const output = testFunction(input); // console.log (output === expect ? 'PASS':'FAIL'); if(output !== expect) { console.log(`FAIL INPUT: ${input} EXPECT: ${expect} OUTPUT: ${output}`); } else { console.log('PASS') ; } } function linkifyBareHttps(inputText){ //console.log(frameMatches('foo', 'fooAB fooB "foo bar fooC\nfooD',['"'], '<a href="', '"></a>')); return frameMatches('https://', inputText,["'",'"'],'<a href="', '"></a>'); //.replace(replacePattern1, '<a href="$2"></a>'); } function getIndicesOf(needle, haystack) { let needleLen = needle.length; let startIndex, index = 0; const indices = []; while ((index = haystack.indexOf(needle, startIndex)) > -1) { indices.push(index); startIndex = index + needleLen; } return indices; } function isPrefixedBy(offset,haystack,prefixes){ if(offset === 0) return false; const match = prefixes.includes(haystack.charAt(offset-1)); return match; } function getIndicesOfNotFollowing(needle, haystack, prefixes){ return getIndicesOf(needle,haystack).filter( (x)=> ! isPrefixedBy(x,haystack,prefixes)); } function isWhiteSpace(c){ return c === ' ' || c === '\t' || c === '\n' || c === '\r'; } function getLengthTilWhiteSpace(haystack,start,maxOffset){ let ptr = start; while(ptr < maxOffset && !isWhiteSpace(haystack.charAt(ptr))) ptr++; return ptr-start; } function getPrefixFilterOffsetsAndLengths(needle, haystack, prefixes){ const indices = getIndicesOfNotFollowing(needle, haystack, prefixes); return indices.map((idx,i)=>[idx, getLengthTilWhiteSpace(haystack,indices[i],i < indices.length -1 ? indices[i+1] : haystack.length)]); } function frameMatches(needle,haystack,blockingPrefixes,before,after){ const placements = getPrefixFilterOffsetsAndLengths(needle, haystack, blockingPrefixes).reverse(); let newString = haystack; placements.forEach(([offset,length]) => { newString = newString.substring(0,offset) + before + newString.substring(offset,offset+length) +after + newString.substring(offset+length); } ); return newString; } //console.log(frameMatchesNotPrefixedBy('foo', 'foo foo afoo bar foo','(',')',['a'])); //console.log(frameMatches('foo', 'fooAB fooB "foo bar fooC\nfooD',['"'], '<a href="', '"></a>')); //console.log(frameMatches('foo', 'XXX fooAB YYY',['"'], '<a href="', '"></a>')); testLinkify();

No comments:

Post a Comment