0

I tried to select content from an HTML page. The problem is the result of DOMDocument($html) has no elements in its '[documentElement]' node. However all texts of the HTML page (excluding HTML tags) are inside the [textContent]. This is how I made DOMDocument object:

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$html = file_get_contents("https://example.com");
$doc->loadHTML($html);

And this is the outputted object:

DOMDocument Object (
[doctype]=> (object value omitted)
[implementation]=> (object value omitted)
[documentElement]=> (object value omitted)
[actualEncoding]=> utf-8
[encoding]=> utf-8
[xmlEncoding]=> utf-8
[standalone]=> 1
[xmlStandalone]=> 1
[version]=>
[xmlVersion]=>
[strictErrorChecking]=> 1
[documentURI]=>
[config]=>
[formatOutput]=>
[validateOnParse]=>
[resolveExternals]=>
[preserveWhiteSpace]=> 1
[recover]=>
[substituteEntities]=>
[nodeName]=> #document
[nodeValue]=>
[nodeType]=> 13
[parentNode]=>
[childNodes]=> (object value omitted)
[firstChild]=> (object value omitted)
[lastChild]=> (object value omitted)
[previousSibling]=>
[nextSibling]=>
[attributes]=>
[ownerDocument]=>
[namespaceURI]=>
[prefix]=>
[localName]=>
[baseURI]=>
[textContent]=> blah blah blah

This way I can't traverse through tags of the HTML and select a specific content. even new DOMXpath($doc) doesn't return useful content which I assume is because the DOMXpath($doc) depends on [documentElement] node of the DOMDocument object. Here's the output of var_dump(new DOMXpath($doc));

object(DOMXPath)#2 (1) { ["document"]=> string(22) "(object value omitted)" }

I tried both curl and file_get_contents methods to get the HTML content and am confident the HTML content is correct (I was able to replicate the HTML page on the PHP file by print_r($html). Also, I've read several answers on StackOverflow, but couldn't solve the problem.

DummyBeginner
  • 411
  • 10
  • 34

1 Answers1

2

printr and DOMDocument aren't particular helpful, the information is there, but it's just not displayed very well ( the (object value omitted) is the hint ).

If instead, you start to use the DOMDocument methods, saveHTML or saveXML. They will format the actual content out for you... If you try

$html = file_get_contents("https://example.com");
$doc->loadHTML($html);
echo "print_r()...";
print_r($doc);
echo "saveHTML()...";
echo $doc->saveHTML();

You should see the difference.

One thing to point out is that to output from a specific point (for example from an XPath result), you use

echo $doc->saveHTML( $xpResultNode );

Edit: with more specific code:

$doc = new DOMDocument();
$html = file_get_contents("https://example.com");
$doc->loadHTML($html);
$xp = new DOMXpath($doc);
$node = $xp->query('//*[@id="datacontainer"]/div[2]/table/tbody/tr[3]/td[4]/table/tr[2]/td');
echo $doc->saveHTML($node[0]);

I've altered the XPath query slightly as there isn't a tbody tag in the last level of table.

Nigel Ren
  • 56,122
  • 11
  • 43
  • 55
  • Thanks, About the last tip you said, So There's no need to `new DOMXpath($doc);`? As I saw in many places it's a required step in this process. [here](https://stackoverflow.com/a/40690725/190929), [here](https://stackoverflow.com/q/6820429/190929), [here](https://stackoverflow.com/q/43977351/190929), [here](https://stackoverflow.com/a/36342681/190929), [here](https://stackoverflow.com/q/7342816/190929) and ... . [This is the code I ran](https://pastebin.com/SuA9AXTC), But Can't access for example a `td` inside the HTML. – DummyBeginner Oct 22 '17 at 16:37
  • Thanks a lot. Would you please edit the url of the site and change it to example.com? So you approve the DOMXpath object is mandatory for accessing html tag's content? About altering the XPath query which you mentioned, I just copied the XPath with the chrome's developer tool. Why was it wrong? – DummyBeginner Oct 22 '17 at 18:03
  • Updated the URL. XPath is very useful for accessing the content, I think it was just the point about using the output of this as the input to saveHTML which I wanted to point out. As for the XPath statement - Chrome tends to add TBODY tags as it feels like it. Best is to grab the actual HTML (i.e. from the `file_get_contents`) generated and work with that. I tend to load it into Eclipse (my ide) and then try the XPath in that. – Nigel Ren Oct 22 '17 at 18:07
  • I just wonder how people extract the content and traverse through XPath object without `saveHTML()` and just with a `for` loop. Like [this](https://stackoverflow.com/a/36342681/190929) – DummyBeginner Oct 22 '17 at 18:26
  • 1
    You can use `->nodeValue` if you just want the text content of a DOMNode, or you sometime use `evaluate` rather than `query` in XPath(https://stackoverflow.com/questions/23793816/what-is-the-difference-between-domxpathevaluate-and-domxpathquery_ – Nigel Ren Oct 22 '17 at 18:32