0

Basically im trying to get it to scrape the url of the poster image but for some reason it's not. The regex is working fine in regex101 but not on the actual page itself.

My code:

<?php

    $url="http://www.imdb.com/title/tt0121955/";

    $ch2 = curl_init();
    curl_setopt ($ch2, CURLOPT_URL, $url);
    curl_setopt ($ch2, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt ($ch2, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31"); 
    curl_setopt ($ch2, CURLOPT_TIMEOUT, 60);
    curl_setopt ($ch2, CURLOPT_SSL_VERIFYHOST, false); 
    curl_setopt ($ch2, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt ($ch2, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($ch2, CURLOPT_REFERER, $url);
    $result = curl_exec ($ch2);
    curl_close($ch2);

    if(preg_match_all('/<td rowspan="2" id="img_primary"><div class="image"><a href="(.*)"><img alt="(.*)" title="South Park \(1997\) Poster" src="(.*)" itemprop="image" height="(.*)" width="(.*)"><\/a><\/div>/', $result, $matches) !== false) {

    foreach($matches as $match) {
        echo $match[0];
        echo $match[1];
        echo $match[2];
        echo $match[3];
    }

    }
?>

Also I did var_dump on $matches and it outputs:

array(6) { [0]=> array(0) { } [1]=> array(0) { } [2]=> array(0) { } [3]=> array(0) { } [4]=> array(0) { } [5]=> array(0) { } } 

So it seems like its not finding anything but strangely it works fine on regex101

Kyubeh2435436
  • 123
  • 1
  • 7

1 Answers1

0

The HTML on the page doesn't match your regex. If you don't need the info, don't try to capture it with regex. Try

preg_match_all('/title="South Park \(1997\) Poster"\s*src="([^"]+)"/m', 
    $result, 
    $matches);

var_dump($matches);

And you're done. IMHO the best way to scrape pages is to use perl.

Tom Pimienta
  • 109
  • 3
  • That wouldnt work as the title="" is different every time you load the page and if u didnt know I already have an answer, HTML Parser. thx anyway – Kyubeh2435436 Jul 01 '15 at 23:56