-1

I am looking to scrape some content of webpages.

I have the following code but it does not work on every page.

$url1 = 'http://www.just-eat.co.uk/restaurants-tomyumgoong/menu';
$url2 = 'http://www.just-eat.co.uk/';

$curl = curl_init($url1);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);

$page = curl_exec($curl);

if (curl_errno($curl)) // check for execution errors
{
  echo 'Scraper error: ' . curl_error($curl);
  exit;
}
echo $page;
curl_close($curl);

$regex = '/<div class="responsive-header-logo">(.*?)<\/div>/s';
if (preg_match($regex, $page, $list))
  echo $list[0];
else 
  print "Not found"; 

$url1 is not working, but when I use $url2 it works like charm.

What can I do to fix this?

Bijan
  • 7,737
  • 18
  • 89
  • 149
jax
  • 57
  • 2
  • 9

3 Answers3

0

Try simplifying the regex to just:

$regex = '/responsive-header-logo/';
trixtur
  • 708
  • 6
  • 14
0

Try this regex: /<div class="responsive-header-logo">([\s\S]*?)<\/div>/.

Dot matches any character except line break, [\s\S] matches any character + line breaks.

For regex testing I'd recommend http://regexr.com/ - this example working: http://regexr.com/3b56u

Martin Janeček
  • 560
  • 4
  • 20
0

First of all, you shouldn't use regex to parse HTML/XML.

Instead, you should use libraries which are designed for it. So either DOM or SimpleXML.

Example using DOM:

$dom = new DOMDocument();
$dom->loadHTML($html);
$finder = new DomXPath($dom);
$classname = "responsive-header-logo";
$nodes = $finder->query("//*[contains(@class, '$classname')]");

Then use $dom->saveHTML to extract the HTML code.

See: How should I get a div's content like this using dom in php?

Community
  • 1
  • 1
kenorb
  • 155,785
  • 88
  • 678
  • 743