1

I have been playing around with cURL and xpath for some webscraping. I finally got my code running as I want but after trying on another side it stopped. The only thing I have changed is the path and url. I'm totally new and only been working with this for a week. Therefore, bear with me if it's an obvious fail.

My code is:

<?php
/*----Connection to Database----*/
include('wp-config.php');
mysql_connect(DB_HOST, DB_USER, DB_PASSWORD);
mysql_select_db("db");

/*----US Dollar Index----*/
$url = "http://www.wsj.com/mdc/public/page/2_3023-fut_index-futures.html";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// Make the cURL request
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
 echo "<br />cURL error number:" .curl_errno($ch);
 echo "<br />cURL error:" . curl_error($ch);
 exit;
}

// Parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// Grab all the MONTH on the page
$xpath = new DOMXPath($dom);

$data = $xpath->query("/html/body/div[6]/div[3]/div/table[9]/tbody/tr[position() >= 3 and position() <=6]");

//[position() >= 1 and position() <=13]

// Searching for data
$values = array();
foreach($data as $row) {
 $values[] = $row->nodeValue;
}

print_r($values);

?>
</body>
</html>
Lars Larsen
  • 21
  • 1
  • 5
  • 1
    Whey you say it stopped, does that mean the script timed out, returned no content, there was an error....etc...? – Rasclatt Nov 08 '15 at 15:27
  • Sorry for not providing that information. The script didn't timeout or returned an error. The only thing being displayed is "Array( )" – Lars Larsen Nov 08 '15 at 15:33
  • what do you mean you "changed path and url"? To what did you change it? the xpath you have is only valid for the url in your code... – drkthng Nov 08 '15 at 16:33
  • Yes, I know and also why I have changed both the url and the xpath. It works on another side with different url and the corresponding xpath. I do same procedure as on the other site, so there must be something I’m doing wrong. – Lars Larsen Nov 08 '15 at 16:42
  • if you change the url to: http://www.cmegroup.com/trading/agricultural/grain-and-oilseed/wheat_quotes_settlements_futures.html and the xpath to `/html/body/div[1]/div[2]/div[1]/div[2]/div[1]/div[4]/div[2]/div[3]/table/tbody/tr[position() >= 1 and position() <=14]` it works. – Lars Larsen Nov 08 '15 at 16:42

2 Answers2

1

A few things come to mind. Have you checked what does the incoming html look like, does it have something that doesn't belong there? And is the xpath you're looking for correct? At least in this older answer it seems that the range for xpath should be given in form

[position() >= 100 and not(position() > 200)]

https://stackoverflow.com/a/3355022/5526468

Edit: And now that I think of it, it might be possible that if there are less than the desired amount of items in the actual html, maybe the xpath valuates the range expression as false and thus none are found with the query?

Community
  • 1
  • 1
Oskari3000
  • 131
  • 5
  • I've used firebug in firefox to copy the xpath directly from the browser. So I don't know if I could get it wrong? I tried using your suggestion to the xpath and it won't work either. I do still think that it says the same as mine but not sure :) I have tried different types of xpath, where I use a direct path to one item in stead of the range and it still not working. – Lars Larsen Nov 09 '15 at 07:30
  • I tried your code, and it seems that it can't parse the html it is receiving. The $html variable gets filled with some sort of html (or at least html-looking data), but when you try to load it with $dom->loadHTML($html), the $dom will not parse it successfully. I think your original code and xpath query are correct, but the problem is in the web page you're trying to parse. Maybe Firefox/Firebug fixes the html on the fly, and therefore you are able to get a correct Xpath when using Firefox. – Oskari3000 Nov 09 '15 at 15:51
  • Okay that sounds like it could be the problem. What could I do to resolve this problem so I get the right html code past into my variable? – Lars Larsen Nov 10 '15 at 07:42
  • Well I haven't done that very ofter, and in fact parsing html is quite a rare situation for me also. :) The links I found seem to suggest that loadHtml() should accept invalid html also, but maybe it needs some additional switches put on or off... maybe using http://php.net/manual/en/function.libxml-use-internal-errors.php will help you find out the problem. Here are some other links that might be of some use: http://stackoverflow.com/questions/3893375/how-can-i-scrape-a-website-with-invalid-html and http://stackoverflow.com/questions/6637125/parsing-invalid-html-from-other-website-using-php – Oskari3000 Nov 10 '15 at 16:25
0

I solved my problem which was the path. The path firebug gave me wasn't the right one for the site. why I don't know.

Lars Larsen
  • 21
  • 1
  • 5