Wednesday, June 15, 2016

php recursive directory iterator and funky character cleanup

Lately I've been switching to PHP from Perl, not just for server processing for basic webpages but for command line maintenance tasks as well. The syntax is so much less wonky, the core documentation solid, and the baked-in utility functions are so well thought out, that I haven't been missing Perl much at all. (On the other hand, I seem to be memorizing very few php functions names - but the process to get stuff done is still so fast, even with all the googling involved.)

(And like I said last year, PHP is much better than Javascript at dealing with global variables in functions.)

I've been running the user-submitted poetry site loveblender.com for decades, literally (if not so literarily). It's had a few overhauls, but it's got is still that bad old Perl, and I guess most of it isn't super-UTF8/16 aware. Usually that's not a problem, but sometimes (I think when someone copies and pastes from Word or what not) funky characters show up - smart quotes, ellipses, etc. I wanted to do a search and replace across my file store. (Guilty confession: for medium-size sites that don't need a lot cross-content grepping, I tend to turn first to flat files in folders- often many small ones, so something like atomicity is preserved even if things go wrong. It has some drawbacks, but I never have to worry about database conversions, and I obviously have a number of tools for hacking the content directly)

Anyway, here is the hack I came up with... and it had to do a lot of work, the Blender has had over 27,000 submissions over the years.

the core idea was to come up with a set of "replacers", and then do a simple search and replace with fixFile()... but to get there I need a few iterations of diagnostic work to figure out what the replacements should be, you can see lookForFunkyCharactersInFile() for oldschool "8-bit ASCII characters 128-255", mostly, and lookForFunkyEntityStringsInFile() for this longer bit of escaping that has been showing up lately. (For the former http://www.ascii-code.com/ was a huge help at figuring out what bad charaters were meant to be, I made a quick macro to import a big swath of the the foriegn characters in a single gulp.)



<?php

$directory = '/home/kirkjerk/domains/loveblender.com/blend/works';



$breaker = 0; # either only show one "funky character" type or 0 for all

$replacers = array(
    "&#226;&#128;&#153;" => "'",
    "&#226;&#128;&#156;" => "\"",
    "&#226;&#128;&#157;" => "\"",
    "&#226;&#128;&#166;" => "...",
    "&#195;&#162;&#226;&#130;&#172;&#194;&#166;" => "...",
    "&#195;&#162;&#226;&#130;&#172;&#226;&#132;&#162;" => "'",
    "&#195;&#162;&#226;&#130;&#172;&#197;&#147;" => "\"",
    "&#195;&#162;&#226;&#130;&#172;&#194;&#157;" => "\"",
    chr(128) => "",
    chr(131) => "&fnof;",
    chr(133) => "...",
    chr(144) => "",
    chr(145) => "'",
    chr(146) => "'",
    chr(147) => "\"",
    chr(148) => "\"",
    chr(150) => "-",
    chr(151) => "--",
    chr(152) => "", #probably some kind of soft hyphen
    chr(153) => "", #probably some kind of soft hyphen
    chr(156) => "", #probably some kind of soft hyphen
    chr(157) => "", #probably some kind of soft hyphen
    chr(160) => " ",
    chr(161) => "&iexcl;",
    chr(163) => "&pound;",
    chr(165) => "&yen;",
    chr(166) => "",#probably some kind of soft hyphen
    chr(167) => "&sect;",
    chr(169) => "(c)",
    chr(170) => "",
    chr(173) => "",
    chr(174) => "(r)",
    chr(180) => "'",
    chr(183) => "*",
    chr(189) => "1/2",
    chr(226) => "'", #this one shows up even tho it's also &acirc;
chr(191) => "&iquest;",chr(192) => "&Agrave;",chr(193) => "&Aacute;",chr(194) => "&Acirc;",chr(195) => "&Atilde;",
chr(196) => "&Auml;",chr(197) => "&Aring;",chr(198) => "&AElig;",chr(199) => "&Ccedil;",chr(200) => "&Egrave;",
chr(201) => "&Eacute;",chr(202) => "&Ecirc;",chr(203) => "&Euml;",chr(204) => "&Igrave;",chr(205) => "&Iacute;",
chr(206) => "&Icirc;",chr(207) => "&Iuml;",chr(208) => "&ETH;",chr(209) => "&Ntilde;",chr(210) => "&Ograve;",
chr(211) => "&Oacute;",chr(212) => "&Ocirc;",chr(213) => "&Otilde;",chr(214) => "&Ouml;",chr(215) => "&times;",
chr(216) => "&Oslash;",chr(217) => "&Ugrave;",chr(218) => "&Uacute;",chr(219) => "&Ucirc;",chr(220) => "&Uuml;",
chr(221) => "&Yacute;",chr(222) => "&THORN;",chr(223) => "&szlig;",chr(224) => "&agrave;",chr(225) => "&aacute;",

chr(227) => "&atilde;",chr(228) => "&auml;",chr(229) => "&aring;",chr(230) => "&aelig;",
chr(231) => "&ccedil;",chr(232) => "&egrave;",chr(233) => "&eacute;",chr(234) => "&ecirc;",chr(235) => "&euml;",
chr(236) => "&igrave;",chr(237) => "&iacute;",chr(238) => "&icirc;",chr(239) => "&iuml;",chr(240) => "&eth;",
chr(241) => "&ntilde;",chr(242) => "&ograve;",chr(243) => "&oacute;",chr(244) => "&ocirc;",chr(245) => "&otilde;",
chr(246) => "&ouml;",chr(247) => "&divide;",chr(248) => "&oslash;",chr(249) => "&ugrave;",chr(250) => "&uacute;",
chr(251) => "&ucirc;",chr(252) => "&uuml;",chr(253) => "&yacute;",chr(254) => "&thorn;",chr(255) => "&yuml;"

);


$it = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($directory));
$it->rewind();

while($it->valid()) {
    $pathname = $it->getSubPathName();
    if (endsWith($pathname, ".work")) {
        #lookForFunkyCharactersInFile($pathname);
        #lookForFunkyEntityStringsInFile($pathname);
        fixFile($pathname);
    }
    $it->next();
}

function fixFile($pathname){
    $guts = file_get_contents($pathname);
    $guts = replaceBads($guts);
    file_put_contents($pathname,$guts);
}

function lookForFunkyEntityStringsInFile($pathname){
    $lines = file($pathname);

    foreach($lines as $line){
        $line = replaceBads($line);
        if(strpos($line,"&#226;")){
            $url = pathnameToLink($pathname);
            print $pathname."  ".$url."\n".$line."\n";
        }
    }
    
}

function lookForFunkyCharactersInFile($pathname){
    global $breaker;
    $lines = file($pathname);
    foreach($lines as $line){
        $line = replaceBads($line);
        $lineIfFunky = isFunky($line,$breaker);
       if($lineIfFunky != "") {
            $url = pathnameToLink($pathname);
            print $pathname."  ".$url."\n".$line.$lineIfFunky."\n";
        }
    }
}


function isFunky($line,$ugh){
    for($i = 0; $i < strlen($line); $i++){ 
        $c = substr($line,$i,1);
        if(
          ($ugh == 0 && ord($c) >= 128) || 
          ($ugh != 0 && ord($c) == $ugh)
          ) {
            return str_repeat(" ",$i)."^         ------ at pos $i got ".ord($c);
        }
    }
    return "";
}
function pathnameToLink($path){
    return "http://www.loveblender.com/blend/wv.cgi?id=".str_replace(".work","",str_replace("/",".",$path));
}
function replaceBads($line){
        global $replacers;
      foreach ($replacers as $badFrom => $badTo){
        $pos = strpos($line,$badFrom);
        if ($pos !== false) {
           # print "  FROM ".$line;
            $line = str_replace($badFrom,$badTo,$line);
           # print "   TO  ".$line;
        }
    }
    return $line;
}
function endsWith($haystack, $needle) {
    return $needle === "" || (($temp = strlen($haystack) - strlen($needle)) >= 0 && strpos($haystack, $needle, $temp) !== false);
}

?>



No comments:

Post a Comment