9

A little new to PHP parsing here, but I can't seem to get PHP's DOMDocument to return what is clearly an identifiable node. The HTML loaded will come from the 'net so can't necessarily guarantee XML compliance, but I try the following:

<?php
header("Content-Type: text/plain");

$html = '<html><body>Hello <b id="bid">World</b>.</body></html>';

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = true;

/*** load the html into the object ***/
$dom->loadHTML($html);
var_dump($dom);    
    
$belement = $dom->getElementById("bid");
var_dump($belement);

?>

Though I receive no error, I only receive the following as output:

object(DOMDocument)#1 (0) {
}
NULL

Should I not be able to look up the <b> tag as it does indeed have an id?

Syscall
  • 19,327
  • 10
  • 37
  • 52
Jé Queue
  • 10,359
  • 13
  • 53
  • 61

2 Answers2

19

The Manual explains why:

For this function to work, you will need either to set some ID attributes with DOMElement->setIdAttribute() or a DTD which defines an attribute to be of type ID. In the later case, you will need to validate your document with DOMDocument->validate() or DOMDocument->validateOnParse before using this function.

By all means, go for valid HTML & provide a DTD.

Quick fixes:

  1. Call $dom->validate(); and put up with the errors (or fix them), afterwards you can use $dom->getElementById(), regardless of the errors for some reason.
  2. Use XPath if you don't feel like validing: $x = new DOMXPath($dom); $el = $x->query("//*[@id='bid']")->item(0);
  3. Come to think of it: if you just set validateOnParse to true before loading the HTML, if would also work ;P

.

$dom = new DOMDocument();
$html ='<html>
<body>Hello <b id="bid">World</b>.</body>
</html>';
$dom->validateOnParse = true; //<!-- this first
$dom->loadHTML($html);        //'cause 'load' == 'parse

$dom->preserveWhiteSpace = false;

$belement = $dom->getElementById("bid");
echo $belement->nodeValue;

Outputs 'World' here.

Wrikken
  • 69,272
  • 8
  • 97
  • 136
  • I do have validateOnParse. setIdAttribute only would apply to set and then subsequent retrieve? Again though, the HTML will be web-provided so I'm at their mercy, but just trying an example. HTML5 doesn't even have a DTD, yes? – Jé Queue Aug 02 '10 at 21:54
  • "setIdAttribute only would apply to set and then subsequent retrieve?" -> Yes. HTML5 is not finished yet so it should not have a DTD yet. – MartyIX Aug 02 '10 at 21:59
  • DTD would be ` `, but just calling `$dom->validate()` would also work. Put up with the errors or try to generate valid HTML (the latter is more difficult than it seems... :) ) – Wrikken Aug 02 '10 at 21:59
  • 2
    @Xepoch I've never managed to get `getElementById` working when using `DOM` with HTML. But you can substitute `getElementById` with an XPath like `//p[@id="foo"]` – Gordon Aug 02 '10 at 22:00
  • @Wrikken doesnt work for me. I'm getting *Trying to get property of non-object* on the `echo` call with PHP 5.3.2 on Vista and libxml 20703 – Gordon Aug 02 '10 at 22:12
  • 1
    Hmm, here it does work, PHP 5.3.2, libxml 2.7.6 (I assume for Windows, 20703 would be 2.7.3), you could try ftp://ftp.zlatkovic.com/libxml/libxml2-2.7.6.win32.zip . Calling `validate()` manually later on also no results? – Wrikken Aug 02 '10 at 22:21
  • ... and if that doesn't work, have you tried using the example from http://www.php.net/manual/en/domimplementation.createdocument.php ? – Wrikken Aug 02 '10 at 22:25
  • @Wrikken Doing `validate()` only gets me a couple of errors about the `html40/loose.dtd` and the same error as before. Using the explicit DTD declaration doesnt help either. Ive tried on an XP machine with 5.3.0 and libxml 20626 and nothing as well. I guess this is either a Windows thing or a libxml thing. I'll try to update it. Upvoted nonetheless though. – Gordon Aug 03 '10 at 07:30
  • 2
    @Gordon: OK, duly noted that this isn't cross-os/version behavior. Thankfully if works on my servers :) The XPath stays a failsafe fallback afaik. – Wrikken Aug 03 '10 at 08:59
  • 1
    @Wrikken after upgrading PHP to 5.3.3 which comes bundled with libxml 2.7.7, getElementById is working. – Gordon Aug 04 '10 at 12:25
  • OK, good news, nice to know live just got that little bit easier :) – Wrikken Aug 04 '10 at 13:22
3

Well, you should check if $dom->loadHTML($html); returns true (success) and I would try

 var_dump($belement->nodeValue);

for output to get a clue what might be wrong.

EDIT: http://www.php-editors.com/php_manual/function.domdocument-get-element-by-id.html - it seems that DOMDocument uses XPath internally.

Example:

$xpath = xpath_new_context($dom);
var_dump(xpath_eval_expression($xpath, "//*[@ID = 'YOURIDGOESHERE']"));
Syscall
  • 19,327
  • 10
  • 37
  • 52
MartyIX
  • 27,828
  • 29
  • 136
  • 207