Friday, May 10, 2013

webscraping in PHP 101

Recently I had to make a "webscraper" web service to present a simple interface to a rather complex login and settings change for a 3rd party website that was not set up as a webservice.

A few things learned: one is  is a pretty cool tool for hitting faking web requests. It had some options to view each step of a login process that had a lot of redirects, and we could see where automated javascript even got in the mix.

The key turned out to be grepping out the jsessionid and using that, and the other key was not trying to skips steps in the redirection, even if they seemed like they shouldn't affect attempting to login (also, clearly printing out the response at each step was helpful, being a bit formal about it.)

We selected PHP for this. Initially I tried to get us to Perl, where I wouldn't have to look up how to do every simple task, but the LWP library was segfaulting (maybe confused by having to make the https requests? Not sure, but it wasn't worth dwelling on)

I don't like PHP... I first tried to use it in the early-2000s, where it was kind of a beta-feeling project, and it has never felt fully baked to me. Unlike Perl which has odd syntaxes that reflect its history but always feels rock-solid and with damn few "oh, Perl is this way because that was easier for the implementor of Perl to write it", PHP still feels like the Preprocessor-for-Perl that it started as. It also has a "Guess What I Mean" philosophy I don't trust... for example, the standard curl functions I used by default blast the result of the request to stdout. While I admit this may be a fairly common use case, the more Unix-ish way of doing it would be to just return the value, and then the programmer can trivially print out that result if they want.

Anyway, for future reference, here's what a POST ended up looking like in PHP
function doPOST($url,$payload){
  print "<h1>POST to $url with $payload</h1>\n";
  //open connection
  $ch = curl_init();
  //set the url, number of POST vars, POST data
  curl_setopt($ch,CURLOPT_URL, $url);
  curl_setopt($ch,CURLOPT_POST, 1);
  curl_setopt($ch,CURLOPT_POSTFIELDS, $payload);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
#  curl_setopt($ch, CURLOPT_VERBOSE, 1);
  curl_setopt($ch, CURLOPT_HEADER, 1);

  $response = curl_exec($ch);
//get the header as separate from the body
  $header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
  $header = substr($response, 0, $header_size);
  $body = substr($response, $header_size);
  print "<hr>$header<hr>".htmlentities($body)."<hr><hr>";
  return $header;
The $payload param should be url encoded key1=val1&key2=val2 type data.... (In both iOS programming and here, I'm surprised making the coders do the encoding themselves and sending the POST raw like that is the more common option.)

No comments:

Post a Comment