Tuesday, November 13, 2018

grep / search for multiple terms across files in php

"Hey, what's the matter?" 
"I'm sad because you're going to die." 
"Yeah, that bugs me sometimes too. But not so much as you think... ...When you get as old as I am, you start to realize that you've told most of the good stuff you know to other people anyway." 
--Richard Feynman and Danny Hillis. 

After blogging for almost two decades my site kirk.is is an increasingly important supplement to my memory - it's great to have an archive for the full text of half-remembered quotes and excerpts.

A long while back I wrote a simple "grep" in Perl to find stuff - it could only find one exact string match across all the files, but often that was enough, and unlike Google, results were in chronological order, which was often useful. And its sense of usefulness (albeit mostly just to me) has increased over the years, so I decided to make it a little smarter and able to search for multiple terms and sentence snippets.

The logic turned out to be slightly trickier than I bargained for - the logic I finally realized I wanted was, for each file, go through each query term. Then go through the lines in the file. If for any query term none of the lines match, bail on this file, otherwise remember which lines matched. If we get to the end of all the terms, every term has at least one match - so for all lines that matched any term, sort 'em (having already made sure they were dedup'd), escape the HTML, bold the matching terms, and then return the lines as an array.

Here's the code for that:
function getLinesMatchingQueryStringInFile($file,$query) {
    $linesInFile = file($file);
    #break up query, using CSV (split on spaces after collapsing whitespace)
    $query = preg_replace('/\s+/',' ',$query);
    $queryterms = str_getcsv(trim($query), ' ');

    $matchingLineNumbers = array();
    
    foreach ($queryterms as $queryterm) {
        $lineNumbersMatchingThisTerm = array();
        foreach ($linesInFile as $linenum => $line) {            
            if (preg_match("/$queryterm/i",$line)) {
                $lineNumbersMatchingThisTerm[] = $linenum;
            }
        }
        #if any query term doesn't match we bail!
        if (count($lineNumbersMatchingThisTerm) == 0) { 
            return array();    
        } else {
             foreach($lineNumbersMatchingThisTerm as $linenum) {
                if (! in_array($linenum,$matchingLineNumbers)) {   
                    $matchingLineNumbers[] = $linenum;
                }
             }
        }
    }
    asort($matchingLineNumbers);
    $res = array();
    foreach ($matchingLineNumbers as $linenum) {
        $matchline = htmlspecialchars($linesInFile[$linenum]);
        foreach ($queryterms as $queryterm) {
            $matchline = preg_replace("/($queryterm)/i","<b>$1</b>",$matchline);
        }
        $res[] = $matchline;
    }
    return $res;
}
A slightly clever bit (i.e. slightly deeper in the StackedOverflow) was trimming the query expression, and then using str_getcsv(), which does a good job of breaking something like 
find "something good"
into 
find
and 
something good
which is what I'd want to search on.

No comments:

Post a Comment