I want to be abled to load any html document and edit it using php's domdocument functionality.
The problem is, that some websites, for example facebook, add XML-style namespaces to their tags.
<fb:like send="true" width="450" show_faces="true"></fb:like>
DOMDocument is very tolerant concerning dirty code but it will not accept namescpaces in html code. What happens is:
- If I use loadHTML to load the code, the namespaces will get stripped out but I need it to stay
- If I use loadXML to load the code, I will get tons of errors that state I'm not loading valid XML
So my idea was to convert the html I get into XML so I can parse it using loadXML. My question is, how do I do this, which tool should I use (I heard of Tidy but I can't get it to work) or is it the better idea to use a different parser (a parser that can handle namespaces in html code)
Code snippet:
<?php
$html = file_get_contents($_POST['url']);
$domDoc = new DOMDocument();
$domDoc->loadHTML($html);
//Just do anything here. It doesn't matter what. For example I'm deleting the head tag
$headTag = $domDoc->getElementsByTagName("head")->item(0);
$headTagParent = $headTag->parentNode;
$headTagParent->removeChild($headTag);
echo $domDoc->saveHTML();
//This will work as expected for any url EXCEPT the ones that use XML namespaces like facebook does as described above. In case of such dirty coding the namespace will get deleted by DOMDocument
?>