0

I've the following HTML code with a custom tag <gcse:search>. It's from "Custom Google Search Engine" to embed in a page.

However, after parsing via PHP DOMDocument, <gcse:search> gets converted to <search> breaking the functionality.

<?php

$html = <<<EOD
<!DOCTYPE html>
<html>
    <body>
        <gcse:search enablehistory="false"></gcse:search>
    </body>
</html>
EOD;

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
echo $dom->saveHTML();

Output:

<!DOCTYPE html>
<html>
    <body>
        <search enablehistory="false"></search>
    </body>
</html>
Gijo Varghese
  • 11,264
  • 22
  • 73
  • 122
  • Does this answer your question? [Load HTML containing namespaces with DOMDocument](https://stackoverflow.com/questions/19855997/load-html-containing-namespaces-with-domdocument) – CBroe Mar 09 '21 at 07:48
  • @CBroe I've already tried the answers from that post. Unfortunately loading HTML content using `loadXML` create a lot more issues. It's not the right way – Gijo Varghese Mar 09 '21 at 11:06
  • Apparently it can't be done using DOMDocument; you may try another library [like HTML5-php](https://github.com/Masterminds/html5-php). – Jack Fleeting Mar 09 '21 at 15:47

2 Answers2

1

You could replace the namespaces by placeholders before parsing the html and converting them back after the saveHtml() call.

<?php

$html = <<<EOD
<!DOCTYPE html>
<html>
    <body>
        <gcse:search enablehistory="false"></gcse:search>
        <gcse:test enablehistory="false"></gcse:test>
        
         <mynamespace:testing enablehistory="false">test</mynamespace:testing>
    </body>
</html>
EOD;

$htmlNamespaces = ['gcse:', 'mynamespace:'];

$namespaceReplacements = array_map(function($index){
    return "ns__" . $index;
}, array_keys($htmlNamespaces));

$html = str_replace($htmlNamespaces, $namespaceReplacements, $html);

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$rawHtml = $dom->saveHTML();

$formattedHtml = str_replace($namespaceReplacements, $htmlNamespaces, $rawHtml);

echo $formattedHtml;

result:

<!DOCTYPE html>
<html>
<body>
  <gcse:search enablehistory="false"></gcse:search>
  <gcse:test enablehistory="false"></gcse:test>
  <mynamespace:testing enablehistory="false">test</mynamespace:testing>
</body>
</html>
MaartenDev
  • 5,631
  • 5
  • 21
  • 33
  • Thanks. Unfortunately it's not possible for me to guess the namespaces. It will be used in a WordPress plugin. So there could be a lot of unknown tags. For now, I've written a regex to detect such tags, convert : to __ and then convert back. – Gijo Varghese Mar 12 '21 at 15:27
0

The only solution I found is to replace : with ___ and then replace back after saveHTML().

$html = preg_replace('/<(\/?)([a-z]+)\:/', '<$1$2___', $html);
$doc->loadHTML($html);
// do stuff
$html = preg_replace('/<(\/?)([a-z]+)___/', '<$1$2:', $doc->saveHtml());
Gijo Varghese
  • 11,264
  • 22
  • 73
  • 122