how to extract specific span content from webpage

Question

I am looking to scrape some content of webpages.

I have the following code but it does not work on every page.

$url1 = 'http://www.just-eat.co.uk/restaurants-tomyumgoong/menu';
$url2 = 'http://www.just-eat.co.uk/';

$curl = curl_init($url1);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);

$page = curl_exec($curl);

if (curl_errno($curl)) // check for execution errors
{
  echo 'Scraper error: ' . curl_error($curl);
  exit;
}
echo $page;
curl_close($curl);

$regex = '/<div class="responsive-header-logo">(.*?)<\/div>/s';
if (preg_match($regex, $page, $list))
  echo $list[0];
else 
  print "Not found";

$url1 is not working, but when I use $url2 it works like charm.

What can I do to fix this?

What do you mean with "not working"? – user_0 Jun 04 '15 at 17:20 — user_0, Jun 04 '15 at 17:20
going to else condition "Not Found"; – jax Jun 04 '15 at 17:23 — jax, Jun 04 '15 at 17:23

score 0 · Answer 1 · answered Jun 04 '15 at 20:13

0

Try simplifying the regex to just:

$regex = '/responsive-header-logo/';

answered Jun 04 '15 at 20:13

trixtur

708
6
14

score 0 · Answer 2 · answered Jun 04 '15 at 20:24

Try this regex: /<div class="responsive-header-logo">([\s\S]*?)<\/div>/.

Dot matches any character except line break, [\s\S] matches any character + line breaks.

For regex testing I'd recommend http://regexr.com/ - this example working: http://regexr.com/3b56u

score 0 · Answer 3 · edited May 23 '17 at 10:27

First of all, you shouldn't use regex to parse HTML/XML.

Instead, you should use libraries which are designed for it. So either DOM or SimpleXML.

Example using DOM:

$dom = new DOMDocument();
$dom->loadHTML($html);
$finder = new DomXPath($dom);
$classname = "responsive-header-logo";
$nodes = $finder->query("//*[contains(@class, '$classname')]");

Then use $dom->saveHTML to extract the HTML code.

See: How should I get a div's content like this using dom in php?

how to extract specific span content from webpage

3 Answers3