I have been trying to parse webpages by use of the HTML DOMObject in order to use them for an application to scan them for SEO quality.
However I have run into a bit of a problem. For testing purposes I've written a small HTML page containing the following incorrect HTML:
<head>
<meta name="description" content="randomdesciption">
</head>
<title>sometitle</title>
As you can see the title is outside the head tag which is the error I am trying to detect.
Now comes the problem, when I use cURL to catch the response string from this page then send it to the DOM document to load it as HTML it actually fixes this by ADDING another <head>
and </head>
tags around the title.
<head>
<meta name="description" content="randomdesciption">
</head>
<head><title>sometitle</title></head>
I have checked the cURL response data and that in fact is not the problem, somehow the PHP DOMDocument during the execution of the loadHTML() method fixes the html syntax.
I have also tried turning off the DOMDocument recover, substituteEntities and validateOnParse attributes by setting them to false, without success.
I have been searching google but I am unable to find any answers so far. I guess it is a bit rare for some one that actually want the broken HTML not being fixed.
Anyone know how to prevent the DOMDocument from fixing my broken HTML?