-1

I want to scrape an HTML page. I am using cURL in PHP for doing the same. I can successfully scrape a specific <div> content. i.e.

<div class="someDiv">ABC</div>

With the following working code

<?php

    $curl = curl_init('https://www.someUrl.com');
    curl_setopt_array($curl, array(     CURLOPT_ENCODING       => '',
                                        CURLOPT_FOLLOWLOCATION => FALSE,
                                        CURLOPT_FRESH_CONNECT  => TRUE,
                                        CURLOPT_SSL_VERIFYPEER => FALSE,
                                        CURLOPT_REFERER        => 'http://www.google.com',
                                        CURLOPT_RETURNTRANSFER => TRUE,
                                        CURLOPT_USERAGENT      => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
                                        CURLOPT_VERBOSE        => FALSE));
    $page = curl_exec($curl);
    if(curl_errno($curl))
    {
        echo 'Scraper error: ' . curl_error($curl);
        exit;
    }
    curl_close($curl);

    $regex = '/<div class="someDiv">(.*?)<\/div>/s';

    if (preg_match_all($regex, $page, $result)){
        echo $result[1][0];
    }
    else{ 
        print "Not found"; 
    }
?>

Now I want to scrape an <img> nested inside a <span>. The code I want to scrape is as follows:

<span class="thumbnail">
    <img src="image.gif" width="20" data-thumb="blabla/photo.jpg" height="20" alt="abc" >
</span>

I want to get the data-thumb from the <img> tag nested inside a <span> having class="thumbnail".

MrShadow
  • 171
  • 3
  • 16

1 Answers1

1

Here we go again...don't use regex to parse html, use an html parser like DOMDocument along with DOMXpath, i.e.:

<?php
...
$page = curl_exec($curl);
$dom = new DOMDocument();
$dom->loadHTML($page);
$xpath = new DOMXpath($dom);
foreach ($xpath->query("//span[@class='thumbnail']/img") as $img){
    echo $img->getAttribute('data-thumb');
}
Community
  • 1
  • 1
Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268