Load HTML containing namespaces with DOMDocument

Question

I've a problem. I want to load a HTML snippet with namespaces in it with DOMDocument.

<div class="something-first">
    <div class="something-child something-good another something-great">
        <my:text value="huhu">
    </div>
</div>

But I can't figure out how to preserve the namespaces. I tried loading it with loadHTML() but HTML does not have namespaces and so they get stripped.

I tried loading it with loadXML() but this doesn't work neither cause <my:text value="huhu"> is not correct XML.

What I need is a loadHTML() method which doesn't strip namespaces or a loadXML() method which does not validate the markup. So a combination of this two methods.

My code so far:

$html = '<div class="something-first">
    <div class="something-child something-good another something-great">
        <my:text value="huhu">
    </div>
</div>';

libxml_use_internal_errors(true);

$domDoc = new DOMDocument();
$domDoc->formatOutput = false;
$domDoc->resolveExternals = false;
$domDoc->substituteEntities = false;
$domDoc->strictErrorChecking = false;
$domDoc->validateOnParse = false;

$domDoc->loadHTML($html/*, LIBXML_NOERROR | LIBXML_NOWARNING*/);
$xpath = new DOMXPath($domDoc);
$xpath->registerNamespace ( 'my', 'http://www.example.com/' );

// -----> This results in zero nodes cause namespace gets stripped by loadHTML()
$nodes = $xpath->query('//my:*');
var_dump($nodes);

Is there a way to achieve what I want? I would be very happy for any advices.

EDIT I opened an enhancment request for libxml2 to provide an option to preserve namespaces in HTML: https://bugzilla.gnome.org/show_bug.cgi?id=711670

Loading something that is neither valid XML nor valid HTML is always going to be tricky when using `loadXML` or `loadHTML`... — lonesomeday, Nov 08 '13 at 09:52
Is it possible to declare the namespace? Something like `……`. DOMDocument should be able to handle namespaces when loaded through loadXML() or load(). — jazZRo, Nov 08 '13 at 10:09
Have deleted my answer as it don't fit your needs. But maybe it's - sad but true - simply not working. Definitely an interesting question.. +1 — hek2mgl, Nov 08 '13 at 10:09
@jazZRo Yeah, that's what I was asking me too.. But when parsing just snippets of HTML like a `
` then it is common that the namespace declaration isn't available in that snippet — hek2mgl, Nov 08 '13 at 10:10
@hek2mgl Till now I tokenize the snippets with my own regex. But I tried to give the PHP built in parser a shot cause everyone says parsing HTML with own regex isn't a good solution. But as I see I propably have to stay on my regex solution. — TiMESPLiNTER, Nov 08 '13 at 10:12
Really don't like to leave you alone with a html-regex-parse solution... (not gave up ;) ) — hek2mgl, Nov 08 '13 at 10:15
@hek2mgl Thx a lot. I mean having namespaces in HTML documents is pretty common with all the `` and `` custom tags. So why `loadHTML()` don't wanna parse them? — TiMESPLiNTER, Nov 08 '13 at 10:20
nvm. I see why it's not valid. Why not add the slash to make it valid? — Anthony, Nov 08 '13 at 17:16
@TiMESPLiNTER Made some research. What we are talking about is called `(X)FBML`. (yes really :) .. Look here http://nerdramblings.tumblr.com/post/3213578636/html5-and-facebooks-fbml — hek2mgl, Nov 09 '13 at 09:17

hek2mgl · Answer 1 · 2013-11-08T13:54:11.710

2

First, namespaces are allowed in XML (or XHTML) only. HTML does not support namespaces.

Given that it is XHTML and the xmlns declaration is present in the snippet, then you can access elements by namespace using DOMDocument::getElementsByTagNameNS():

$html = <<<EOF
<div xmlns:my="http://www.example.com/" class="something-first">
    <div class="something-child something-good another something-great">
        <my:text value="huhu" />
    </div>
</div>
EOF;

$domDoc = new DOMDocument();
$domDoc->loadXML($html);
var_dump(
  // it is possible to use wildcard `*` here
  $domDoc->getElementsByTagNameNS('http://www.example.com/', '*')
);

However as it is common that the namespace declaration is defined in the root element <html> rather than in sub nodes, the code above will not work in most cases..

So part two of the solution would be to check if the declaration is present and if not inject it.... (working on this)

As I said, the code above works for XML / XHTML only. It is still open how to do that with HTML. (check the discussion below)

edited Nov 08 '13 at 13:54

answered Nov 08 '13 at 09:53

hek2mgl

152,036
28
249
266

This won't work because the namespace gets stripped during the parsing of my HTML snippet with `loadHTML()`. – TiMESPLiNTER Nov 08 '13 at 09:53
Yeah you are right. You can only select the `text` nodes.. (seems so, let me dig more into this) – hek2mgl Nov 08 '13 at 09:55
I want access all elements with namespace `my`. So access the elements with `//text` unfortunately isn't an option neither :-(. Would be great if you find a way to achieve what I wan't :-). – TiMESPLiNTER Nov 08 '13 at 09:57
I'm searching for a way – hek2mgl Nov 08 '13 at 09:59
So far so good. That I had earlier this day too. Problem is, you have to put in valid XML. So if your snippet is missing a closing `` or something like this. `loadXML()` will fail. – TiMESPLiNTER Nov 08 '13 at 10:38
Of course DOM methods will work with well formed documents only. If you are working with not well formed documents, then you can't even say that you are working with `HTML` or `XML`. Then it's just text. But I think this is out of topic here. – hek2mgl Nov 08 '13 at 10:41
Well but it's valid to write `` and `loadXML()` will fail to load this. But `loadHTML()` can handle this. – TiMESPLiNTER Nov 08 '13 at 10:43
OK, I see now. Thought that the snippet above will work with both `xml` and `html`... – hek2mgl Nov 08 '13 at 10:47
No it doesn't. That's exactly my problem I mentioned in my question ;-). And I thinke there is no possibility so working with my own regex tokenizer is the best solution I think. But I'm thinking about to report an "issue" to the libxml project. – TiMESPLiNTER Nov 08 '13 at 10:48
Might be. Just a side note, PHP is using libxml2 internally, meaning that if it isn't possible with PHP then it won't being possible with most *nix xml tools.. :| – hek2mgl Nov 08 '13 at 10:50
But PHP uses `libxml2` for both methods `loadXML()` and `loadHTML()` but `loadHTML()` does accept non valid XML and `loadXML()` does accept namespaces... so it seems libxml2 provides everything needed but a method which summerizes all up in `DOMDocument` is missing. – TiMESPLiNTER Nov 08 '13 at 10:57
If I can help you with anything, let me know :-). – TiMESPLiNTER Nov 08 '13 at 11:00
@TiMESPLiNTER Have digged trough the [source code](https://github.com/php/php-src/blob/master/ext/dom/document.c#L2203) It does not look that PHP is doing much here. (`htmlParseDocument()` is a native libxml method) – hek2mgl Nov 08 '13 at 12:09
Thx for your effort. Okay then I'v to make a feature request for libxml2. But where? – TiMESPLiNTER Nov 08 '13 at 12:11
@TiMESPLiNTER Wait.. I thing the answer is simple: namespaces are allowed in XML (XHTML) only. http://stackoverflow.com/questions/7824705/does-html5-support-namespaces .. That's it.. (Thanks to that guys who thrown away xhtml and *invented* HTML5... never liked that. But yes, using HTML5 lazy web developers may omit the `` again.. Hooray!) – hek2mgl Nov 08 '13 at 12:29
Okay *grrmmml* I opened a bug report anyway then it gets closed soon I think: https://bugzilla.gnome.org/show_bug.cgi?id=711670 But why not provide an option to allow it? libxml2 has to understand HTML5 someday anyway. – TiMESPLiNTER Nov 08 '13 at 12:43
As I said HTML5 does not support namespaces. XHTML was great, but they *invented* HTML5 I still ask me *why?* – hek2mgl Nov 08 '13 at 13:35
Well some things are great in HTML5. So an image tag for example would always be `` because it can not contain any further markup so why not write ``? The question is: why are no namespaces allowed in HTML5? – TiMESPLiNTER Nov 08 '13 at 13:38
Why not writing `` if this makes life easier (as you see)? However, I've read the wikipedia page of HTML5 again and it seems like `xhtml` is still alive, but however it is rarely used since html5 came up. If I would be the guy who had do decide the future of HTML I would throw pure HTML away and use XHMTL only. – hek2mgl Nov 08 '13 at 13:42
After this problem I faced today, me too. But I can not control which HTML I have to parse. So there should be at least a possbility to preserve namespaces in libxml2 which can be set to `true` if this behavior is wished and all problems are gone. – TiMESPLiNTER Nov 08 '13 at 13:50
`But I can not control which HTML I have to parse` yeah, of course. Although we could find some arguments (for whatever) your question is still hot and open.. – hek2mgl Nov 08 '13 at 13:52
maybe you could post a link to the question in the PHP chat room. Would be interesting to hear what others say.. – hek2mgl Nov 08 '13 at 13:53

score 2 · Answer 2 · answered Nov 08 '13 at 17:15

2

Technically it's neither valid XML or HTML (or XHTML) because HTML does not allow for namespaced elements while valid XML requires that empty elements be self-closing and that the namespace be registered. So your basically asking "how can I have DOMDocument treat this invalid HTML as valid XML even though it's not valid XML either?" which is going to prove difficult and one might ask why should libxml be updated to allow for this? If I update your snippet to:

$html = <<<XML
<div xmlns:my="http://www.example.com/" class="something-first">
    <div class="something-child something-good another something-great">
        <my:text value="huhu" />
    </div>
</div>
XML;

adding in the NS registration and closing the my:text, it works just fine with:

$domDoc = new DOMDocument();
$domDoc->loadXML($html);
echo $domDoc->saveXML();

Notice that the namespace is not stripped out. The namespace is stripped out, as I understand it, because it's not valid XML or HTML. The XPath can't query by the namespace since the namespace wasn't defined via xmlns and therefore was dropped.

So I guess the question is: Why are you petitioning for invalid XML support rather than adding that closing slash? Is it because the data is from an external source or because in some context the empty non-closing tag is valid?

answered Nov 08 '13 at 17:15

Anthony

36,459
25
97
163

Nice to see another opinion here.. Unfortunately it is the same than my... (what should you otherwise say, I think it just like you and me said.) .. However, the `` elements are pain in the ass! Do you really think facebook writes unvalid HTML .... (just a question)?.. maybe we should ask them... – hek2mgl Nov 08 '13 at 20:26
Might be that you didn't see that the behaviour is different when using `loadHTML` and `loadXML` (like me before)... I think this is a reasonable question as it is a real world problem.. (OP hasn't designed the HTML. it could be anything) – hek2mgl Nov 08 '13 at 20:30
My guess is that Facebook serves valid XHTML, though I can't say for sure, since I don't ever interact with Facebook. If the `xmlns` for `fb` namespace is provided, then its valid. It's one thing for html to be malformed, but XML is generally more strictly parsed and with namespaces it's required beyond best practices to have `xmlns` . Chrome won't display the original snippet, why should a less forgiving lexer? – Anthony Nov 08 '13 at 20:59
1

@hek2mgl - ignore last comment. fat fingers on a touch screen. Something I find really interesting is the on-going pushback for closing empty elements. There are probably 100+ questions related to the goal of not enforcing this rule, not to mention tons of back-and-forth in the HTML spec on whether to enforce this, but to me it always made sense that if you have a tag that could be interpreted as an opening tag but does not have closing tag (like `
`) it should have some polite indicator (like `
`) informing the parser that there's no end tag coming. – Anthony Nov 08 '13 at 21:07
Yeah. I cannot understand this discussion (`
or
`).. It's `
` .. That's it! *point*! :) However we need to discuss this `` elements (if you like, of course).. Because they aren't just served by facebook. they are included in several (millions?) of other (HTML) sites.. I'm tired for today.. but would really like to find the final answer here. (for that reasons I would even deal with the devil and create a fb account (if necessary)) :) – hek2mgl Nov 08 '13 at 22:12
Made some research. What we are talking about is called `(X)FBML`. (yes really :) .. Look here http://nerdramblings.tumblr.com/post/3213578636/html5-and-facebooks-fbml – hek2mgl Nov 09 '13 at 09:18
(X)FBML what the heck ;-)? We all know that it should be `
` instead of `
` fact is, that a lot of HTML markup doesn't close ``, `
`, `
`, etc tags. That's not our fault but we have to live (and deal) with it cause it's common (bad) practice in HTML5. E.x. if you use `loadHTML()` for `` and you use afterwards `saveHTML()` it gets `` so the `img` tag gets recognized by libxml2 that it is selfclosing. But not `fb:like` cause it's an own creation... – TiMESPLiNTER Nov 11 '13 at 07:28
Yes, but the html that doesn't close its tags is valid html ( even if its a bad habit ) but not valid XML and thus has no business using namespaces. You can't parse invalid XML as poorly formed html or parse poorly formed html and expect XML syntax to also pass through ( or you can, apparently, but not with namespaces being retained). I would consider some intermediate function that closes these tags. – Anthony Nov 11 '13 at 15:09

Load HTML containing namespaces with DOMDocument

2 Answers2

Linked