I want to scrape an HTML page. I am using cURL in PHP for doing the same.
I can successfully scrape a specific <div>
content. i.e.
<div class="someDiv">ABC</div>
With the following working code
<?php
$curl = curl_init('https://www.someUrl.com');
curl_setopt_array($curl, array( CURLOPT_ENCODING => '',
CURLOPT_FOLLOWLOCATION => FALSE,
CURLOPT_FRESH_CONNECT => TRUE,
CURLOPT_SSL_VERIFYPEER => FALSE,
CURLOPT_REFERER => 'http://www.google.com',
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
CURLOPT_VERBOSE => FALSE));
$page = curl_exec($curl);
if(curl_errno($curl))
{
echo 'Scraper error: ' . curl_error($curl);
exit;
}
curl_close($curl);
$regex = '/<div class="someDiv">(.*?)<\/div>/s';
if (preg_match_all($regex, $page, $result)){
echo $result[1][0];
}
else{
print "Not found";
}
?>
Now I want to scrape an <img>
nested inside a <span>
. The code I want to scrape is as follows:
<span class="thumbnail">
<img src="image.gif" width="20" data-thumb="blabla/photo.jpg" height="20" alt="abc" >
</span>
I want to get the data-thumb
from the <img>
tag nested inside a <span>
having class="thumbnail"
.