DOM parser that allows HTML5-style tag

Question

Update: html5lib (bottom of question) seems to get close, I just need to improve my understanding of how it's used.

I am attempting to find an HTML5-compatible DOM parser for PHP 5.3. In particular, I need to access the following HTML-like CDATA within a script tag:

<script type="text/x-jquery-tmpl" id="foo">
    <table><tr><td>${name}</td></tr></table>
</script>

Most parsers will end parsing prematurely because HTML 4.01 ends script tag parsing when it finds ETAGO (</) inside a <script> tag. However, HTML5 allows for </ before </script>. All of the parsers I have tried so far have either failed, or they are so poorly documented that I haven't figured out if they work or not.

My requirements:

Real parser, not regex hacks.
Ability to load full pages or HTML fragments.
Ability to pull script contents back out, selecting by the tag's id attribute.

Input:

<script id="foo"><td>bar</td></script>

Example of failing output (no closing </td>):

<script id="foo"><td>bar</script>

Some parsers and their results:

DOMDocument (fails)

Source:

<?php

header('Content-type: text/plain');
$d = new DOMDocument;
$d->loadHTML('<script id="foo"><td>bar</td></script>');
echo $d->saveHTML();

Output:

Warning: DOMDocument::loadHTML(): Unexpected end tag : td in Entity, line: 1 in /home/adam/public_html/2010/10/26/dom.php on line 5
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><script id="foo"><td>bar</script></head></html>

FluentDOM (fails)

Source:

<?php

header('Content-type: text/plain');
require_once 'FluentDOM/src/FluentDOM.php';
$html = "<html><head></head><body><script id='foo'><td></td></script></body></html>";
echo FluentDOM($html, 'text/html');

Output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head></head><body><script id="foo"><td></script></body></html>

phpQuery (fails)

Source:

<?php

header('Content-type: text/plain');

require_once 'phpQuery.php';

phpQuery::newDocumentHTML(<<<EOF
<script type="text/x-jquery-tmpl" id="foo">
<td>test</td>
</script>
EOF
);

echo (string)pq('#foo');

Output:

<script type="text/x-jquery-tmpl" id="foo">
<td>test
</script>

html5lib (passes)

Possibly promising. Can I get at the contents of the script#foo tag?

Source:

<?php

header('Content-type: text/plain');

include 'HTML5/Parser.php';

$html = "<!DOCTYPE html><html><head></head><body><script id='foo'><td></td></script></body></html>";
$d = HTML5_Parser::parse($html);

echo $d->saveHTML();

Output:

<html><head></head><body><script id="foo"><td></td></script></body></html>

Note: when you try to parse HTML via loadHTML, DOM based libraries will use libxml's HTML parser module. If you load your snippet above with loadXML instead, there will be no errors, but of course, the page is expected to be valid XHTML then. Also see [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) but basically all DOM based parsers will likely produce the same results here. — Gordon, Oct 27 '10 at 08:05
+1 for a good question. I wonder if it would be possible to use HTML comments or a CDATA block to delimit the code in the script tag, as one would do for Javascript? Or would that also get included in the output from the template? — Spudley, Sep 04 '11 at 06:57

score 11 · Answer 1 · answered May 24 '12 at 01:09

11

I had the same problem and apparently you can hack your way trough this by loading the document as XML, and save it as HTML :)

$d = new DOMDocument;
$d->loadXML('<script id="foo"><td>bar</td></script>');
echo $d->saveHTML();

But of course the markup must be error-free for loadXML to work.

answered May 24 '12 at 01:09

Alex

66,732
177
439
641

1

Also note that this will break on HTML5 elements that aren't self closing (link, img, br, etc) since those are illegal in XML. – Mike 'Pomax' Kamermans Feb 05 '13 at 21:24

score 7 · Answer 2 · answered Mar 12 '20 at 12:22

7

I just find out (in my case).

try to change parameters option of loadHTML using LIBXML_SCHEMA_CREATE in DOMDocument

$dom = new DOMDocument;

libxml_use_internal_errors(true);
//$dom->loadHTML($buffer, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->loadHTML($buffer, LIBXML_SCHEMA_CREATE);

answered Mar 12 '20 at 12:22

Moch Zawaruddin Abdullah

101
1
6

1

This actually handled it. Are there side effects/limitations? – Samuele Diella Feb 18 '22 at 13:03

score 5 · Answer 3 · answered Oct 27 '10 at 01:52

5

Re: html5lib

You click on the download tab and download the PHP version of the parser.

You untar the archive in a local folder

 tar -zxvf html5lib-php-0.1.tar.gz
 x html5lib-php-0.1/
 x html5lib-php-0.1/VERSION
 x html5lib-php-0.1/docs/
 ... etc

You change directories and create a file named hello.php

cd html5lib-php-0.1
touch hello.php

You place the following PHP code in hello.php

$html = '<html><head></head><body>
<script type="text/x-jquery-tmpl" id="foo">
<table><tr><td>${name}</td></tr></table>
</script> 
</body></html>';
$dom = HTML5_Parser::parse($html); 
var_dump($dom->saveXml()); 
echo "\nDone\n";

You run hello.php from the command line

php hello.php

The parser will parse the document tree, and return a DOMDocument object, which can be manipulated as any other DOMDocument object.

answered Oct 27 '10 at 01:52

Alana Storm

164,128
91
395
599

Thanks for the pointers. How can I dig down to the contents of the script tag, searching by id? – Annika Backstrom Oct 27 '10 at 02:18
1

It's a standard DOMDocument object. If you're not comfortable with the DOMDocument, then call the saveXML method (as above) and create a SimpleXml object out of it. If you're not comfortable with Simple XML, you should read the manual – Alana Storm Oct 27 '10 at 04:03
Added html5lib to [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Oct 27 '10 at 08:06
@Alan I hit a wall (well, got mildly annoyed) when I couldn't get `$dom->getElementById()` to work on the resulting DOMDocument. I ended up working around the problem, but I'd be interested to know why it fails and if it can be made to work. – Annika Backstrom Oct 27 '10 at 12:29
1

Because DOMDocument is a confusing pile of over engineered poorly document XML processing? For getElementById to work with DOM documents you need to have a DTD that says which attribute name is an ID, or explicitly set which attribute name on an element is an ID. Whenever I have a DOMDocument I save out an XML string to feed into SimpleXML, and then use the xPath functions to get at what I want. – Alana Storm Oct 27 '10 at 19:16
@Adam More info on why your call wasn't working. Sort of went beyond the 600 character limit :) http://alanstorm.com/domdocument_php_stop – Alana Storm Oct 27 '10 at 23:51
@Adam no problem. You might also be interested in my answer to [Simplify PHP DOM XML Parsing](http://stackoverflow.com/questions/3405117/simplify-php-dom-xml-parsing-how/3405651#3405651). Also, the id attributes in DOM example in your blog post are not unique, so even if they were proper xml:id attributes, the XML wouldnt be valid. – Gordon Oct 28 '10 at 07:24

score 5 · Answer 4 · answered Nov 04 '10 at 12:20

5

FluentDOM uses the DOMDocument but blocks loading notices and warnings. It does not have an own parser. You can add your own loaders (For example one that uses the html5lib).

answered Nov 04 '10 at 12:20

ThW

19,120
3
22
44

alex · Answer 5 · 2011-10-25T02:27:16.610

4

I added comment tags () in my jQuery template blocks (CDATA blocks also failed) and DOMDocument did not touch the internal HTML.

Then, before I used the jQuery templates, I wrote a script to remove the comments.

$(function() {
    $('script[type="text/x-jquery-tmpl"]').text(function() {
        // The comment node in this context is actually a text node.
        return $.trim($(this).text()).replace(/^<!--([\s\S]*)-->$/, '$1');
    });
});

Not ideal, but I wasn't sure of a better workaround.

edited Oct 25 '11 at 02:27

answered Oct 24 '11 at 23:35

alex

479,566
201
878
984

1

I mean... i'm using <% %> tags (for underscore templating) and commenting it doesnt work. I'd love to prevent XMLDocument to parse script inner texts/htmls – Tom Roggero Nov 23 '12 at 21:12

score 3 · Answer 6 · answered Sep 19 '13 at 03:58

I ran into this exact problem.

PHP Dom Document parses the html inside a script tag and that can actually lead to a completely different dom.

Since I didn't want to use another library than DomDocument. I wrote a few lines that strips any script content, then you do what ever you need to do with dom document and afterwards you put that script content back.

Obviously the script content isn't available to your dom object because it's empty.

With the following lines of php code you can 'fix' this problem. Be warned that script tags in scripts tags will cause bug.

$scripts = array();
// this will select all script tags non-greedy. If you have a script tag in your script tag, it will cause problems.
preg_match_all("/((<script.*>)(.*))\/script>/sU", $html, $scripts);
// Make content of scripts empty
$html = str_replace($scripts[3], '', $html);

// Do DOM Document stuff here

// Put script contents back
$html = str_replace($scripts[2], $scripts[1], $html);

I hope this will help some people :-).

This is almost a good solution. However, it doesn't work when the script tags have no attributes that distinguish them from each other. — mavrosxristoforos, Nov 12 '18 at 14:46

DOM parser that allows HTML5-style tag

DOMDocument (fails)

FluentDOM (fails)

phpQuery (fails)

html5lib (passes)

6 Answers6

Linked