4

I want to be abled to load any html document and edit it using php's domdocument functionality.
The problem is, that some websites, for example facebook, add XML-style namespaces to their tags.

<fb:like send="true" width="450" show_faces="true"></fb:like>

DOMDocument is very tolerant concerning dirty code but it will not accept namescpaces in html code. What happens is:

  • If I use loadHTML to load the code, the namespaces will get stripped out but I need it to stay
  • If I use loadXML to load the code, I will get tons of errors that state I'm not loading valid XML

So my idea was to convert the html I get into XML so I can parse it using loadXML. My question is, how do I do this, which tool should I use (I heard of Tidy but I can't get it to work) or is it the better idea to use a different parser (a parser that can handle namespaces in html code)

Code snippet:

<?php
$html = file_get_contents($_POST['url']);

$domDoc = new DOMDocument();
$domDoc->loadHTML($html);

//Just do anything here. It doesn't matter what. For example I'm deleting the head tag
$headTag = $domDoc->getElementsByTagName("head")->item(0);
$headTagParent = $headTag->parentNode;
$headTagParent->removeChild($headTag);

echo $domDoc->saveHTML();

//This will work as expected for any url EXCEPT the ones that use XML namespaces like facebook does as described above. In case of such dirty coding the namespace will get deleted by DOMDocument

?>

Syndace
  • 96
  • 6
  • possible duplicate of [Convert HTML code to doc using PHP and PHPWord](http://stackoverflow.com/questions/30076922/convert-html-code-to-doc-using-php-and-phpword) – Varun Naharia May 07 '15 at 09:02
  • pls edit your question and add a minimum example of your HTML/XML whatever. – michi May 07 '15 at 09:14
  • @Varun Naharia I'm sorry but this doesn't help me at all. Thats no answer to my question. – Syndace May 07 '15 at 12:07
  • @michi I really don't think a code example is needed here. I just want to be abled to convert any HTML code to XML. Just any, nothing special. – Syndace May 07 '15 at 12:08
  • you are not converting html to doc ? – Varun Naharia May 07 '15 at 12:10
  • No :/ I'm converting html to xml to work around an issue with dirty html code – Syndace May 07 '15 at 12:14
  • docx has a xml file in you can use that try to rename docx file to .zip extension and open document.xml file if this the file you want you can refer that link otherwise please explain problem in detail with context – Varun Naharia May 07 '15 at 12:24
  • First of all sorry if my question is so hard to understand. It is my first time posting here and I was expecting problems as I'm not native english. I think I will edit the whole post. Wait a few mins until I'm done editing – Syndace May 07 '15 at 12:34
  • [Related question](http://stackoverflow.com/questions/19855997/load-html-containing-namespaces-with-domdocument) about namespaced elements in HTML (which is not supported). – Ja͢ck May 07 '15 at 13:16
  • I have read that post before and it helped me understanding why my problem exists and I know that there is no "clean" solution for the problem, as the code itself is not clean. I am still looking for a workaround – Syndace May 07 '15 at 13:25
  • You can use `->loadXML()` and register the namespace ... then, you can suppress the warnings; see also [this answer](http://stackoverflow.com/questions/1148928/disable-warnings-when-loading-non-well-formed-html-by-domdocument-php/17559716#17559716) – Ja͢ck May 07 '15 at 13:29
  • I cannot use loadXML with html code because it will fail and return false because html is obviously no xml. The first sentence in my post is "I want to be abled to load any html document and edit it using php's domdocument functionality". That means, I use file_get_contents() with whatever URL and then I am trying to edit it with DOMDocument. – Syndace May 07 '15 at 13:44
  • @Syndace a code example may clarify the question and attract fellow developer's answers – michi May 07 '15 at 16:09
  • I think you are all expecting something more complex than it actually is. I'll add a code snippet – Syndace May 07 '15 at 18:39

2 Answers2

4

There is no clean way to parse HTML with namespaces using DOMDocument without losing the namespaces but there are some workarounds:

  • Use another parser that accepts namespaces in HMTL code. Look here for a nice and detailed list of HTML parsers. This is probably the most efficient way to do it.
  • If you want to stick with DOMDocument you basically have to pre- and postprocess the code.

    • Before you send the code to DOMDocument->loadHTML, use regex, loops or whatever you want to find all namespaced tags and add a custom attribute to the opening tags containing the namespace.

      <fb:like send="true" width="450" show_faces="true"></fb:like>
      

      would then result in

      <fb:like xmlNamespace="fb" send="true" width="450" show_faces="true"></fb:like>
      
    • Now give the edited code to DOMDocument->loadHTML. It will strip out the namespaces but it will keep the attributes resulting in

      <like xmlNamespace="fb" send="true" width="450" show_faces="true"></like>
      
    • Now (again using regex, loops or whatever you want) find all tags with the attribute xmlNamespace and replace the attribute with the actual namespace. Don't forget to also add the namespace to the closing tags!

Community
  • 1
  • 1
Syndace
  • 96
  • 6
1

Building on Syndace's answer, here is some regex-based code that will escape out your namespaces by replacing each colon with "___" (you can choose some other escape sequence that you think is safer):

$modifiedHtml = preg_replace('/<(\/?)([a-z]+)\:/', '<$1$2___', $inputHtml);
$x = $doc->loadHTML($modifiedHtml);
// ...if desired, do stuff to your parsed html here...
$outputHtml = preg_replace('/<(\/?)([a-z]+)___/', '<$1$2:', $doc->saveHtml);

This should work on <fb:like>, <mynamespace:mytag> or anything else you throw at it.

xgretsch
  • 1,294
  • 13
  • 15