2

I'm using QueryPath to manipulate a pages DOM. The page I'm manipulating has some tags that QueryPath doesn't know how to interpret.

I've tried passing the following as options but I still get errors:

ignore_parser_warnings
use_parser (html)

I get the following errors with these enabled:

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag nobr invalid in Entity

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity

Any help would be greatly appreciated.

Jon
  • 428,835
  • 81
  • 738
  • 806
digital
  • 2,079
  • 3
  • 25
  • 35
  • there is not a single reason to set php5 in tags. php5 is a current version for already *six* years. php **is** meant php5 and nothing else. It's php4 and php6 require special tag, not php5. – Your Common Sense Oct 21 '10 at 12:12

3 Answers3

7

Use htmlqp() instead of qp(). The htmlqp() function does a substantial amount of fixing for yucky HTML.

Technosophos
  • 521
  • 4
  • 5
2

Try the libxml functions

libxml_use_internal_errors(TRUE);
$dom->load('whatever'); // or whatever you use for loading the DOM
libxml_clear_errors();

Instead of just clearing the erros, you can opt to handle them, though the above should be sufficient for most cases.

Gordon
  • 312,688
  • 75
  • 539
  • 559
  • Do you mean that libxml_clear_errors forces DomDocument to parse malformed html? – ymakux May 01 '15 at 13:12
  • 1
    @ymakux no, `libxml_clear_errors` will just delete the internal error buffer. If you want to parse malformed HTML, just use `loadHTML` or `loadHTMLFile`. That will make libxml use the HTML parser module, which will attempt to fix any broken markup for you. – Gordon May 01 '15 at 13:17
  • I've tried both loadHTML and loadHTMLFile. They return empty DOM object on invalid html (two body tags, wrong DOCTYPE etc) – ymakux May 01 '15 at 13:21
  • 1
    @ymakux if you mean empty when checking with `var_dump`, then [that's to be expected until recent versions of PHP](http://stackoverflow.com/questions/4776093/why-doesnt-var-dump-work-with-domdocument-objects-while-printdom-savehtml). Echo the HTML to see whether the markup was parsed: https://eval.in/322104. libxml can parse most (not all) malformed markup. – Gordon May 01 '15 at 13:28
-1

Just use an @ in front of your QueryPath functions to suppress the warnings. While invalid HTML may generate warnings, it can generally handle it just fine.

MarathonStudios
  • 3,983
  • 10
  • 40
  • 46
  • and of course, obtaining no input because an exception was generated and then wondering why your code doesnt work anymore because it doesnt generate errors anymore sounds like a marvellous waste of time. – Christopher Thomas May 14 '13 at 10:36