I'm trying to extract some data from Google Shopping (I know it's against the ToS but there's no other way to get the data) and I just can't get it to work. I've tried doing this using the simple_html_dom class without success before switching to XPath as that's apparently better. Neither has worked :-(
Here's the page I'm trying to extract data from:
https://www.google.co.uk/search?output=search&tbm=shop&q=PCWorld.co.uk
Each of the products is held in an li with the class psli so I'd like to pull them and then loop through extracting the information that I need which is held in a series of div's which I was going to read through using another XPath query.
However this doesn't work. Here's the code that I'm using to get the content of the li's:
// Get Google shopping page using cURL
$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_URL, 'https://www.google.co.uk/search?output=search&tbm=shop&q=PCWorld.co.uk');
$data = curl_exec($ch);
echo $data;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
@$dom->loadHTML($data);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$query = $xpath->query('//li[@class="psli"]');
$length=$query->length;
echo ("Length: $length<br />");
foreach($query as $node) {
$q = $xpath->query('.//div[@class="pslicont"]/div[@class="pslimg"]/div[@class="overlay-container"]/a', $node);
echo $q->item(0)->attributes->getNamedItem('href')->value;
}
The value of $query is always null ($data shows the page html so I know that's working) but if I change my query to find the tip of the branch:
$query = $xpath->query('//h3[@class="r"]');
then the length of $query is 20 which is the number of products on the page so I know it's getting something even if it throws up errors about not being able to find the href values.
What I can't figure out is how to select the all of li's rather than the very tip of each one (the h3). I'm sure this is just down to some stupid typo but I have literally spent the entire day on this, read a hundred articles and gone around in circles.
Where am I going wrong ?