I tried to select content from an HTML page. The problem is the result of DOMDocument($html)
has no elements in its '[documentElement]' node. However all texts of the HTML page (excluding HTML tags) are inside the [textContent]
.
This is how I made DOMDocument
object:
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$html = file_get_contents("https://example.com");
$doc->loadHTML($html);
And this is the outputted object:
DOMDocument Object (
[doctype]=> (object value omitted)
[implementation]=> (object value omitted)
[documentElement]=> (object value omitted)
[actualEncoding]=> utf-8
[encoding]=> utf-8
[xmlEncoding]=> utf-8
[standalone]=> 1
[xmlStandalone]=> 1
[version]=>
[xmlVersion]=>
[strictErrorChecking]=> 1
[documentURI]=>
[config]=>
[formatOutput]=>
[validateOnParse]=>
[resolveExternals]=>
[preserveWhiteSpace]=> 1
[recover]=>
[substituteEntities]=>
[nodeName]=> #document
[nodeValue]=>
[nodeType]=> 13
[parentNode]=>
[childNodes]=> (object value omitted)
[firstChild]=> (object value omitted)
[lastChild]=> (object value omitted)
[previousSibling]=>
[nextSibling]=>
[attributes]=>
[ownerDocument]=>
[namespaceURI]=>
[prefix]=>
[localName]=>
[baseURI]=>
[textContent]=> blah blah blah
This way I can't traverse through tags of the HTML and select a specific content. even new DOMXpath($doc)
doesn't return useful content which I assume is because the DOMXpath($doc)
depends on [documentElement]
node of the DOMDocument
object. Here's the output of var_dump(new DOMXpath($doc));
object(DOMXPath)#2 (1) { ["document"]=> string(22) "(object value omitted)" }
I tried both curl
and file_get_contents
methods to get the HTML content and am confident the HTML content is correct (I was able to replicate the HTML page on the PHP file by print_r($html)
. Also, I've read several answers on StackOverflow, but couldn't solve the problem.