133

I've been attempting to parse HTML5-code so I can set attributes/values within the code, but it seems DOMDocument(PHP5.3) doesn't support tags like <nav> and <section>.

Is there any way to parse this as HTML in PHP and manipulate the code?


Code to reproduce:

<?php
$dom = new DOMDocument();
$dom->loadHTML("<!DOCTYPE HTML>
<html><head><title>test</title></head>
<body>
<nav>
  <ul>
    <li>first
    <li>second
  </ul>
</nav>
<section>
  ...
</section>
</body>
</html>");

Error

Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 4 in /home/wbkrnl/public_html/new-mvc/1.php on line 17

Warning: DOMDocument::loadHTML(): Tag section invalid in Entity, line: 10 in /home/wbkrnl/public_html/new-mvc/1.php on line 17

Klaas S.
  • 1,572
  • 2
  • 10
  • 11
  • Ops, for me `loadHTML($HTML5)` returns FALSE (failure)! I need to change the new tags to DIVs... It is not only a problem of "warnings" on my screen. – Peter Krauss Feb 03 '14 at 21:22
  • 3
    This issue had been reported for PHP at https://bugs.php.net/bug.php?id=60021 which in turn spawned a feature request in the underlying libxml2: https://bugzilla.gnome.org/show_bug.cgi?id=761534 – cweiske Feb 04 '16 at 07:57

6 Answers6

236

No, there is no way of specifying a particular doctype to use, or to modify the requirements of the existing one.

Your best workable solution is going to be to disable error reporting with libxml_use_internal_errors:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML('...');
libxml_clear_errors();
rap-2-h
  • 30,204
  • 37
  • 167
  • 263
lonesomeday
  • 233,373
  • 50
  • 316
  • 318
  • 2
    Ops, for me `loadHTML($HTML5)` returns FALSE (failure)! I need to change the new tags to DIVs... – Peter Krauss Feb 03 '14 at 21:22
  • 41
    Any reason __*php7*__'s built-into DOM parser _still_ can't handle HTML5? It's been 6 years since this answer was submitted. – Super Cat Jul 01 '17 at 21:44
  • 2
    @SuperCat It's all dependant on the underlying libxml library. – lonesomeday Jul 02 '17 at 12:59
  • 9
    --- not to mention HTML5 isn't XML, never was, has been, nor will be... – Kevin_Kinsey Apr 09 '18 at 18:17
  • 3
    **Update 2019**: The warning is still fired however `loadHTML` now actually accept HTML5 tags. –  Aug 17 '19 at 14:23
  • 1
    @user10351292 since which version? Can't find a changelog for libxml2. As of libxml2 version 2.9.10, still have to use `libxml_use_internal_errors(true)` to avoid having it throw an exception on HTML5 tags. (Found one more reason to hate PHP.) – Fabien Snauwaert Feb 04 '21 at 12:58
  • 2
    **2022** *still* not support HTML 5 (tested with PHP 8.2 on Windows). However, no one mentioned about `LIBXML_NOERROR` option? – vee Dec 18 '22 at 03:39
  • 1
    This is a great solution, @vee. Thank you very much. I used `$DOMDocument -> loadHTML($HTML_String, LIBXML_NOERROR);` on an HTML String which contained `
    ` and the Warning disappeared.
    – Rounin Jan 29 '23 at 00:52
  • It's probably best practice to restore original state as best one can. `$original = libxml_use_internal_errors(TRUE);`, do the stuff. `libxml_clear_errors();` `libxml_use_internal_errors($original);` Working on a non-trivial project and having a method modify global state in an undocumented manner doesn't seem like a good idea. – Luke A. Leber Jan 31 '23 at 14:28
24

You could also do

@$dom->loadHTML($htmlString);
Christian
  • 27,509
  • 17
  • 111
  • 155
Ilker Mutlu
  • 744
  • 6
  • 18
  • 25
    Error suppression is not a proper way of dealing with this issue. – Klaas S. Sep 12 '14 at 09:55
  • 13
    @KlaasSangers Until we have a non-crippled DOM implementation, I'm afraid it is (either through `@` or `libxml_*`) – Dan Lugg Sep 18 '14 at 20:56
  • 10
    yeah, in this specific case, error supression is the best solution, in my opinion. unless you know that the HTML you will be loading, is supposed to be 100% valid HTML per PHP's definition. which in my experience, is never the case. – hanshenrik Feb 21 '15 at 08:38
  • 1
    @KlaasSangers...why not? – Nick Manning Apr 21 '15 at 08:19
  • 4
    PHP8 "The @ operator no longer silences fatal errors It's possible that this change might reveal errors that again were hidden before PHP 8. Make sure to set display_errors=Off on your production servers!" https://stitcher.io/blog/new-in-php-8 – marcus Feb 27 '20 at 18:50
9

You can filter the errors you get from the parser. As per other answers here, turn off error reporting to the screen, and then iterate through the errors and only show the ones you want:

libxml_use_internal_errors(TRUE);
// Do your load here
$errors = libxml_get_errors();

foreach ($errors as $error)
{
    /* @var $error LibXMLError */
}

Here is a print_r() of a single error:

LibXMLError Object
(
    [level] => 2
    [code] => 801
    [column] => 17
    [message] => Tag section invalid

    [file] => 
    [line] => 39
)

By matching on the message and/or the code, these can be filtered out quite easily.

halfer
  • 19,824
  • 17
  • 99
  • 186
5

There doesn't seem to be a way to kill warnings but not errors. PHP has constants that are supposed to do this, but they don't seem to work. Here is what is SHOULD work, but doesn't because (bug?)....

 $doc=new DOMDocument();
 $doc->loadHTML("<tagthatdoesnotexist><h1>Hi</h1></tagthatdoesnotexist>", LIBXML_NOWARNING );
 echo $doc->saveHTML();

http://php.net/manual/en/libxml.constants.php

user2782001
  • 3,380
  • 3
  • 22
  • 41
  • According to this post https://stackoverflow.com/a/41845049/937477 that bug has been fixed – mmmmm Oct 09 '17 at 09:34
  • 1
    Just to be pedantic, that is not valid HTML5. Custom elements have to have a hyphen in them according to the spec http://w3c.github.io/webcomponents/spec/custom/#dfn-custom-element-type – Greg Sep 30 '19 at 10:45
  • @Greg Good to know. It's just a test to demonstrate the xml parser will recognize the tag is not valid, but ignore it because of the flag. – user2782001 Oct 03 '19 at 01:21
-3

This worked for me:

$html = file_get_contents($url);

$search = array("<header>", "</header>", "<nav>", "</nav>", "<section>", "</section>");
$replace = array("<div>", "</div>","<div>", "</div>", "<div>", "</div>");
$html = str_replace($search, $replace, $html);

$dom = new DOMDocument();
$dom->loadHTML($html);

If you need the header tag, change the header with a div tag and use an id. For instance:

$search = array("<header>", "</header>");
$replace = array("<div id='header1'>", "</div>");

It's not the best solution but depending on the situation it can be useful.

Good luck.

Emiliano Sangoi
  • 921
  • 10
  • 20
-9

HTML5 tags almost always use attributes such as id, class and so on. So the code for replacing will be:

$html = file_get_contents($url);
$search = array(
    "<header", "</header>", 
    "<nav", "</nav>", 
    "<section", "</section>",
    "<article", "</article>",
    "<footer", "</footer>",
    "<aside", "</aside>",
    "<noindex", "</noindex>",
);
$replace = array(
    "<div", "</div>",
    "<div", "</div>", 
    "<div", "</div>",
    "<div", "</div>",
    "<div", "</div>",
    "<div", "</div>",
    "<div", "</div>",
);
$html = str_replace($search, $replace, $html);
$dom = new DOMDocument();
$dom->loadHTML($html);
Pang
  • 9,564
  • 146
  • 81
  • 122