0

I want to parse a html file.

$html =htmlentities( file_get_contents('http://forums.heroesofnewerth.com/showthread.php?553261'));
$dom = new DOMDocument();
$dom->loadHTML($html);//line 30

I'm getting these errors

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 113 in D:\Projects\Web projects\done\honscript\index.php on line 30

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 113 in D:\Projects\Web projects\done\honscript\index.php on line 30

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 200 in D:\Projects\Web projects\done\honscript\index.php on line 30

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 200 in D:\Projects\Web projects\done\honscript\index.php on line 30

Changed to using htmlenttities and getting

Warning: DOMDocument::loadHTML(): Empty string supplied as input in D:\Projects\Web projects\done\honscript\index.php on line 30
George Irimiciuc
  • 4,573
  • 8
  • 44
  • 88

1 Answers1

0

The document you're trying to load is not valid HTML and thus not valid DOM (see http://validator.w3.org/check?verbose=1&uri=http%3A%2F%2Fforums.heroesofnewerth.com%2Fshowthread.php%3F553261 for an extensive list of HTML errors on that page).

So PHP basically has to guess what's meant by the HTML it's provided with and warns about that (it might guess wrong).

The & is a special character in HTML which is used to escape special characters (for example to print < in a HTML page you'd have to write &lt;. It also has a special meaning in URLs as a separator for request variables (e.g. http://example.com?foo=bar&braz=omfg) and thus appears a lot in websites. The correct way to write an & within HTML is &amp;.

Probably the guesses are correct and the DOMDocument will work just fine. So you could just surpress this warning like so:

@$dom->loadHTML($html);

Otherwise you'd have to fix the HTML somehow. Just running it through htmlentities as mentioned above will not work since it'll also escape all tag markers etc.

What probably might work is replacing all & with &amp; although this might lead to other consequences as &amp; would become &amp;amp; so you'd have to only replace those &s that aren't followed by a amp;.

David Triebe
  • 375
  • 1
  • 5
  • Why isn't it valid HTML if it's a website, though? And only & does generate problems? – George Irimiciuc Jan 13 '15 at 15:36
  • HTML is a standard with certain rules and that website doesn't follow the rules (see http://validator.w3.org/check?verbose=1&uri=http%3A%2F%2Fforums.heroesofnewerth.com%2Fshowthread.php%3F553261 for what's wrong). HTML parsers are basically built to work around wrong HTML by guessing. That's why the website still works. – David Triebe Jan 13 '15 at 15:39
  • Added a bit information about why & is special to the answer. – David Triebe Jan 13 '15 at 15:45