8

I have been trying to parse webpages by use of the HTML DOMObject in order to use them for an application to scan them for SEO quality.

However I have run into a bit of a problem. For testing purposes I've written a small HTML page containing the following incorrect HTML:

<head>
<meta name="description" content="randomdesciption">
</head>
<title>sometitle</title>

As you can see the title is outside the head tag which is the error I am trying to detect.

Now comes the problem, when I use cURL to catch the response string from this page then send it to the DOM document to load it as HTML it actually fixes this by ADDING another <head> and </head> tags around the title.

<head>
<meta name="description" content="randomdesciption">
</head>
<head><title>sometitle</title></head>

I have checked the cURL response data and that in fact is not the problem, somehow the PHP DOMDocument during the execution of the loadHTML() method fixes the html syntax.

I have also tried turning off the DOMDocument recover, substituteEntities and validateOnParse attributes by setting them to false, without success.

I have been searching google but I am unable to find any answers so far. I guess it is a bit rare for some one that actually want the broken HTML not being fixed.

Anyone know how to prevent the DOMDocument from fixing my broken HTML?

Syscall
  • 19,327
  • 10
  • 37
  • 52
Björn
  • 203
  • 6
  • 19
  • Have you considered running your markup through [tidy](http://php.net/tidy) before passing it to DOM, or even in lieu of DOM? It's a useful extension for detecting markup errors. – TML Jan 17 '12 at 16:26
  • 1
    Note: This behaviour is actually as specified in HTML: `` has an optional opening and closing tag and is implied by the presence of a head-only element like ``, meaning that a `<title>` outside the head will be parsed as being within a `` element with its opening tag omitted. Once read into memory the DOM doesn't preserve which optional tags were present in the source as that is not part of the semantics of the document so they are always output as present. Using HTML_PARSE_NO_IMPLIED can have side effects on how some valid HTML documents are interpreted. – thomasrutter Dec 20 '17 at 03:37
  • Possible duplicate of [How to saveHTML of DOMDocument without HTML wrapper?](https://stackoverflow.com/questions/4879946/how-to-savehtml-of-domdocument-without-html-wrapper) – miken32 May 16 '19 at 03:30

1 Answers1

8

UPDATE: as of PHP 5.4 you can use HTML_PARSE_NO_IMPLIED

$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED);

Original answer below

You cant. In theory there is a flag HTML_PARSE_NO_IMPLIED for that in libxml to prevent adding implied markup, but its not accessible from PHP.

On a sidenote, this particular behavior seems to depend on the LIBXML_VERSION used.

Running this snippet:

<?php
$html = <<< HTML
<head>
<meta name="description" content="randomdesciption">
</head>
<title>sometitle</title>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html);
$dom->formatOutput = true;
echo $dom->saveHTML(), LIBXML_VERSION;

on my machine will give

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta name="description" content="randomdesciption"></head>
<title>sometitle</title>
</html>
20707
Gordon
  • 312,688
  • 75
  • 539
  • 559
  • 1
    That's not what i had hoped for but atleast i can stop searching for something that simply isn't there. Thank you for your help it has been most informative. – Björn Jan 17 '12 at 15:56
  • 5
    This is now avaiable in PHP v5.4+ with the [loadhtml](http://php.net/manual/en/domdocument.loadhtml.php) method's second parameter 'options'. – Robert Brisita Dec 03 '14 at 22:28