unable to scrape content from a website

Question

I am trying to scrap some content from a website but the code below is not working(not showing any output). here is the code

$url="some url";
$otherHeaders="";   //here i am using some other headers like content-type,userAgent,etc
some curl to get the webpage
...
..
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$content=curl_exec($ch);curl_close($ch);

$page=new DOMDocument();
$xpath=new DOMXPath($page); 
$content=getXHTML($content);  //this is a tidy function to convert bad html to xhtml 
$page->loadHTML($content);    // its okay till here when i echo $page->saveHTML the page is displayed

$path1="//body/table[4]/tbody/tr[3]/td[4]";
$path2="//body/table[4]/tbody/tr[1]/td[4]";

$item1=$xpath->query($path1);
$item2=$xpath->query($path2);

echo $item1->length;      //this shows zero 
echo $item2->length;      //this shows zero

foreach($item1 as $t)
echo $t->nodeValue;    //doesnt show anything
foreach($item2 as $p)
echo $p->nodeValue;    //doesnt show anything

i am sure there is something wrong with the above xpath code. the xpaths are correct. I have checked the above xpaths with FirePath (a firefox addon). I know i am missing something very silly here but i cant make out. Please help. I have checked similar code for scraping links from Wikipedia(definitely the xpaths are different) and it works nicely. So i dont understand why the above code does not work for the other URLs. I am cleaning the HTML content with Tidy so i dont there is a problem with xpath not geeting the HTML right? i have checked the length of the nodelist after $item1=$xpath->query($path1) which is 0 which means something is going wrong with $xpath->query because the xpaths are correct as i have checked with FirePath I have modified my code a bit as pointed out and used loadXML instead of loadHTML. but this gives me error as Entity 'nbsp' not defined in Entity so i used the libxml option LIBXML_NOENT to substitute entities but still the errors remain.

What do `$t->nodeName` and `$t->nodeType` output in your foreach loops at the end? — Michael Berkowski, May 29 '11 at 15:32
@Michael: they both dont output anything. the browser window is just blank. — lovesh, May 29 '11 at 16:19
Good question, +1. See my answer for two recommendations that will help construct and use the right XPath expressions. — Dimitre Novatchev, May 29 '11 at 16:48

Tomalak · Answer 1 · 2011-05-29T17:11:04.627

5

Yes, you are missing something very basic: It's XHTML, so you must register (and use!) the proper namespace before you can expect to get results.

$xpath->registerNamespace('x', 'http://www.w3.org/1999/xhtml');

$path1="//x:body/x:table[4]/x:tbody/x:tr[3]/x:td[4]";
$path2="//x:body/x:table[4]/x:tbody/x:tr[1]/x:td[4]";

$item1=$xpath->query($path1);
$item2=$xpath->query($path2);

edited May 29 '11 at 17:11

answered May 29 '11 at 15:45

Tomalak

332,285
67
532
628

@Tomalak: when i modify my code as above it gives me an error as `Parse error: syntax error, unexpected T_VARIABLE in C:\xampp\htdocs\rtu\rtu_results.php on line 24` line 24 here is the line `$path1="//x:body/x:table[4]/x:tbody/x:tr[3]/x:td[4]";` i have scraped web pages like this before from my `localhost` but never needed namespace – lovesh May 29 '11 at 16:49
@Lovesh that syntax error would indicate you're missing a `;` on the previous line. – Marc B May 29 '11 at 17:09
@Marc: You are right, I missed a semi-colon. Thanks. @lovesh: A little more independent thinking, please. ;-) I'm sure that's not the first time you see such an error. – Tomalak May 29 '11 at 17:12
@Marc: I have added the semicolon to the line `$xpath->registerNamespace('x', 'http://www.w3.org/1999/xhtml')` and it doesnt give the error now but its not showing the output yet – lovesh May 29 '11 at 17:18
1

@lovesh: Please test with a simple `"//x:table"` as an XPath expression. If this gives you all tables in your document, then the namespace is working but your own XPath expression is wrong. If this is not working then the namespace `"http://www.w3.org/1999/xhtml"` is not the right one and you must check against your XHTML document what namespace it is actually using. – Tomalak May 29 '11 at 17:41
Check that your `getXHTML()` function is actually returning something. If it's mangling the input, then xpath won't help you – Marc B May 29 '11 at 17:53
@Marc B: the `getXHTML()` function is returning the XHTML after cleaning the HTML – lovesh May 29 '11 at 18:05
1

The XHTML may not have the correct xmlns. Check the value of `$page->documentElement->namespaceURI` and if it is not null you should pass that value into `registerNamespace()`. – cmbuckley May 30 '11 at 11:54
@cbuckley:`$page->documentElement->namespaceURI` is null. i mean it is not showing any output – lovesh May 30 '11 at 16:29
@Tomalak: i am using this namespace as `$xpath->registerNamespace('x', 'http://www.w3.org/1999/xhtml')` and i used the xpath as `//x:table` but again the `xpath->query` is returning a nodelist with 0 length and the browser window is still blank. as i am converting a page from another website to `xhtml` bu using `tidy`, how can i know the namespace used by that site and do namespaces change once the `html` goes through `Tidy` – lovesh May 30 '11 at 16:56
@Tomalak: I have checked the namespace of the document by viewing source in `firefox` it is `http://www.w3.org/1999/xhtml`. i used this namespace in `registernamespace` but still no output – lovesh May 30 '11 at 17:13
@lovesh: You do not provide enough information. Write a small-but-complete code example that reproducibly shows the error. Otherwise nobody here can help you. You should have done this from the start. – Tomalak May 30 '11 at 19:15
@Tomalak: ok i will show you the complete code. actually its like sumbmitting a value to a form and then i have to scrap the content generated dynamicall. i will also provide the urls taht i am scraping to – lovesh May 30 '11 at 22:31

score 4 · Answer 2 · answered May 31 '11 at 00:32

It seems that the problem is somehow related to XPath and namespaces. Php manual revealed an interesting user comment

If you've registered your namespaces, loaded your XHTML, etc., into your XPath's DOMDocument object and still can't get it to work, check to make sure you haven't used the DOMDocument's loadHTML() or loadHTMLFile() function. For XHTML always use the XML versions, otherwise your XPath will never, ever work.

Your code uses loadHTML()

$content=getXHTML($content);  //this is a tidy function to convert bad html to xhtml 
$page->loadHTML($content);    // its okay till here when i echo $page->saveHTML the page is displayed

HTML is not namespace aware so loadHTML() might not set the namespaces on the elements of the document object even though the original document (or the XHTML outputted by Tidy) had them.

Because you use Tidy to convert the document to XHTML, I guess you could safely use loadXML() without running into parsing errors. Note that it will require that the input is well-formed XML. Also it might not be aware of HTML predefined entities like   and if that is the case, it can't replace the entities with their correct character values. If such problem arises, try setting different options for loadXML().

+1 Recommended in private e-mail. Should have followed it up here, but thanks for adding the user comment. — cmbuckley, May 31 '11 at 11:32
thanks for this. you are right, using `loadXML` is giving errors `Entity 'nbsp' not defined in Entity, line: 212 in filename on line 10` where line 10 is the line with loadXML. i tried using options for `loadXML` like `$page->loadXML($content,LIBXML_NOENT);` for substituting entities but the errors remain. can u tell me which option or combination of options can make this work? — lovesh, May 31 '11 at 14:06
@lovesh: Sorry, I'm not familiar with those options. Other possibility to fix entity problems is to check if Tidy can do the entity replacement. — jasso, Jun 01 '11 at 15:01

score 2 · Answer 3 · answered May 29 '11 at 16:47

2

I have heard that FireFox adds a tbody element if such isn't present.

In addition to or independently of @Tomalak's advice, try the XPath expressions with the /tbody location step removed.

Also, use another tool as the XPath Visualizer to construct correct XPath expressions and see immediately what they are selecting.

answered May 29 '11 at 16:47

Dimitre Novatchev

240,661
26
293
431

@Dimitre Novatchev: i tried your suggestion but it giving an error as `Parse error: syntax error, unexpected T_VARIABLE in C:\xampp\htdocs\rtu\rtu_results.php on line 27` where line 27 is `$path1="//body/table[4]/tr[3]/td[4]";` – lovesh May 29 '11 at 16:56
@Dimitre Novatchev: I tried the `xpath` with `google chrome` but i get the same error – lovesh May 29 '11 at 16:58
@lovesh: The XPath expression is syntactically correct -- the syntax error should be in your PHP statement. – Dimitre Novatchev May 29 '11 at 17:00
@Dimitre Novatchev: I think the php is correct too because i have scraped content from other pages but i have done it from `localhost` that is my own web server in the same way.i used to save pages to my disk first.any other suggestion that u have got? can it happen like you can forward my question to someone who can help? – lovesh May 29 '11 at 17:06
@lovesh: I don't know anytjing about PHP. As far as XPath is concerned, I have answered your question completely. You can test the proposed solution on a static XML document to verify it is working. If this is confirmed then the problem is in your dynamic access to the document and I recommend that you ask a different question using appropriate question-tags specifically about this problem. I would be glad to further assist with any XPath-related issues. – Dimitre Novatchev May 29 '11 at 17:27
@Dimitre Novatchev: Thanks for the help – lovesh May 29 '11 at 17:43
@Dimitre Novatchev: can i post this same question again because somebody might close it calling it a duplicate? is there some other way to bring people's attention to this question? – lovesh May 29 '11 at 17:51
1

@lovesh: Why should you post the question again? Better edit it and add new, relevant information. For example, provide a sample XML file -- as minimal as possible. Then many people will be able to help. – Dimitre Novatchev May 29 '11 at 18:00
@Lasse V. Karlsen: can i do something else to get more people's attention to this question? I am stuck on this for long. I have mdified my question to include some additional test that i did – lovesh May 29 '11 at 20:53
I don't know much, if anything, about PHP, but you're saying it does not work, and it generally helps if you say what happens, what you expected, and what you tried to fix the problem. Just saying "it doesn't work" is understood by many as too vague to bother helping you with. In this case, did you try changing the xpath queries gradually up from nothing to what you have in your question, to figure out which particular node(s) you're not locating? – Lasse V. Karlsen May 29 '11 at 20:58
@Lasse V. Karlsen: as i have mentioned in this post above that there is no output and and i have tried implementing the above suggestions but nothing works. i expect to see the data from the selected fields and this code works for wikipedia but its not working for some urls. – lovesh May 29 '11 at 21:03

score 1 · Accepted Answer · answered Jun 10 '11 at 06:47

This question reminds me that a lot of times the solution to a problem lies in simplicity and not complications. i was trying namespaces,error corrections,etc but the solution just demanded close inspection of the code. the problem with my code was the order of loadHTML() and xpath initialization. initially the order was

$xpath=new DOMXPath($page);
$page->loadHTML($content);

by doing this i was actually initializing xapth on an empty document. now reversing the order by first loading the dom with the html and then initializing the xpath i was able to get the desired results. Also as suggested that by removing the tbodyelement from xpath as firefox automatically inserts it. so the correct xpath should be

$path1="//body/table[4]/tr[3]/td[4]";
$path2="//body/table[4]/tr[1]/td[4]";

thanks to everyone for their suggestions and bearing this.

cmbuckley · Answer 5 · 2011-05-30T12:11:15.387

0

(Try the following both in combination with and separately from the other answers, as they are other possible caveats.)

If your XPath isn't working, try applying just parts of it to make sure you are indeed following the right path. So do something like:

$path1="//body";
$item1 = $xpath->query($path1);

foreach ($item1 as $t) {
    // to see the full XML of the returned node, as the nodeValue may be empty
    echo $t->ownerDocument->saveXML($t); 
}

Then keep increasing your XPath to the location you want.

Also, if you find that nodeValue and textContent of your nodes are empty, you should make sure that you are loading into the DOMDocument with the correct encoding (e.g. if the cURL response returns UTF-8, you'll need to pass 'UTF-8' as the second parameter when constructing your DOMDOcument).

edited May 30 '11 at 12:11

answered May 30 '11 at 12:05

cmbuckley

40,217
9
77
91

i tried your suggestion but it is not showing any outpt. Now i am absolutely sure what the problem is. `$xpath->query($path1); ` is not getting the `xpath`. can u imagine why? – lovesh May 30 '11 at 16:45
the DOMDocument is being loaded properly as i have checked with $page->saveHTML(). it is displaying the page in the browser – lovesh May 30 '11 at 22:28
How about instead of using XPaths for testing, you check the element returned by `$page->getElementsByTagName('body')->item(0)`? You can keep following the path in the same way by chaining those methods. – cmbuckley May 30 '11 at 23:28
how do i find the encoding of the cURL response? – lovesh May 31 '11 at 20:39
It will (hopefully) be in the `Content-Type` response header. You'll need to do something like `curl_setopt($ch, CURLOPT_HEADER, 1);` and then split the headers from the body with `list($header, $body) = explode("\r\n\r\n", $content, 2);`. Have a look at http://www.sitepoint.com/forums/php-34/getting-response-header-php-curl-request-590248.html for more info. – cmbuckley May 31 '11 at 22:11

unable to scrape content from a website

5 Answers5

Linked