2

I am parsing a website's HTML and there is a 'table' inside an 'a':

<?php 

$dom = new DOMDocument;

$dom->loadHTML("<!DOCTYPE html>
<html>
  <head></head>
  <body>
    <a>
      <table><tr><td></td></tr></table>
    </a>
  </body>
</html>");

if ($dom->getElementsByTagName("table")->item(0)->parentNode->nodeName == "body")
  echo "Why is table a child of 'body'? It should be a child of 'a'.";

I also get this warning:

PHP Warning:  DOMDocument::loadHTML(): Unexpected end tag : a in Entity, line: ...

I am using PHP 7.4.

I know 'table's are not officially allowed inside 'a's. BUT:

  1. The warning is a completely different message.
  2. Making the 'table' a child of 'body' because I've put it inside an 'a' does not make sense.

What can I do? I want that at least the table is not a child of body. Because like this I cannot parse sites properly.

Edit: Please read the comments under this question. Tables are allowed inside 'a's in this case in HTML5. So this behavior is even more strange.

zomega
  • 1,538
  • 8
  • 26

1 Answers1

0

When loading HTML content, DOMDocument "fixes" your document. You can see by printing the parsed HTML that the <table> has been moved outside the <a>:

echo $dom->saveHTML();

Output (not formatted):

<!DOCTYPE html>
<html><head></head><body>
    <a>
        </a><table><tr><td></td></tr></table></body></html>

You can try loading your document as XML instead:

$dom->loadXml('your HTML string');

Calling $dom->saveXml() shows the document structure has not changed.
You get the correct parent node when fetching the <table>:

echo $dom->getElementsByTagName("table")->item(0)->parentNode->nodeName;
// Output: a

Wild guess about the warning you got (I don't know how it works internally, I can't be sure):

The parser sees the opening <a> and then the opening <table>:

...<body><a><table>

As it considers having a <table> inside a <a> is "wrong", it closes the <a> before the opening <table>:

...<body><a></a><table>

Later in the document, it finds your original closing </a>, alone, which throws the error of "Unexpected end tag"

...<body><a></a><table>...</table></a>
AymDev
  • 6,626
  • 4
  • 29
  • 52
  • I would mark your answer as accepted. But please read the comments under my question. Tables are allowed inside 'a's. – zomega Dec 03 '22 at 07:53
  • @zomega okay then I guess your question could be a duplicate of [this one](https://stackoverflow.com/questions/10712503/how-to-make-html5-work-with-domdocument): DOMDocument doesn't follow HTML5 specs. I'm afraid I can't improve my answer that much, that's the only workaround I found. – AymDev Dec 03 '22 at 14:19