0

I have a PHP script that submits a curl request to market.android.com/mylibrary and retrieves the page, then parses it using regex. In the first link below you can see that when run it will output "Something went wrongSomething went wrong" corresponding to each of the regex tests at the bottom. Now if you comment line 74 and uncomment 75 it will work. If you would like to see what the curl is returning just add echo($result); at the bottom.

Be sure to fill in your Google creds at the top and enable curl in your webserver --> Example file 1

Now in this second example I have taken only the relevant portions from the curl results and manually escaped all the apostrophes. I put the same regex strings at the bottom and it works exactly as expected.

Example file 2

Is anyone able to see what is causing the problem? I have tried using preg_last_error() but it simply returns 0. Thanks!

tgrosinger
  • 2,463
  • 2
  • 30
  • 38
  • why are you not using a dom parser like DomDocument – Lawrence Cherone Feb 17 '12 at 22:37
  • I had never heard of this before. I would still need to use regex for finding what I need though right? – tgrosinger Feb 17 '12 at 22:39
  • 1
    no. you NEVER parse html/xml with a regex. You'll just end ripping out your hair trying to figure out why it's not working. – Marc B Feb 17 '12 at 22:42
  • no it replaces regex, using regex to parse html is bad practice, as eventually your expression will fail due to changes at the sources site – Lawrence Cherone Feb 17 '12 at 22:43
  • Well that is very good to know, do you guys have any particularly good articles on using this method instead? I only see a way to pass in a filename or a url, no method of handing it a string that contains an entire page like what curl gives me. – tgrosinger Feb 17 '12 at 22:45
  • It's not that you couldn't use regex for HTML *extraction* (some newcomers here always confuse that with parsing). But it's only advisable to people proficient with it. Regarding your repurposed topic, there are duplicates en masse. http://stackoverflow.com/questions/3650125/how-to-parse-html-with-php – mario Feb 17 '12 at 22:51

1 Answers1

0

I got this working thanks to the tips provided in the comments. Here is the solution:

$doc = new DOMDocument();
@$doc->loadHTML($result);
$images = $doc->getElementsByTagName('img');

$apps = array();

foreach($images as $img) {
    $alt = $img->getAttribute('alt');
    if($alt != '') {
        $src = $img->getAttribute('src');
        if(strpos($src, 'data:image/gif;base64') !== false) {
            $src = $img->getAttribute('data-lazysrc');
        }
        $apps[$alt] = $src;
    }
}

return $apps;
tgrosinger
  • 2,463
  • 2
  • 30
  • 38