3

I have been working on this tidy-up-messy-html tags with DOM, but now I realise a bigger problem,

$content = '<p><a href="#">this is a link</a></p>';

function tidy_html($content,$allowable_tags = null, $span_regex = null)
{      
    $dom = new DOMDocument();
    $dom->loadHTML($content);

        // other codes
    return $dom->saveHTML();
}

echo tidy_html($content);

It will output the entire DOM,

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> 
<html><body><p><a href="#">this is a link</a></p></body></html> 

but I only want something like this in the return,

<p><a href="#">this is a link</a></p>

I don't want,

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> 
    <html><body>...</body></html>

Is this possible??

EDIT:

the innerHTML simulation generates some strange codes in my database, like &#13; ,  , ’

<p>Monday July 5th 10am - 3.30pm £20</p>&#13;
<p>Be one of the first visitors to the ...at this special event.Â</p>&#13;
<p>All participants will receive a free copy of the ‘Contemporary Art Kit’ produced exclusively for Art on....</p>&#13;

the innerHTML simulation,

$innerHHTML = '';
$nodeBody = $dom->getElementsByTagName('body')->item(0);
foreach($nodeBody->childNodes as $child) {
  $innerHTML .= $nodeBody->ownerDocument->saveXML($child);
}

I found out that the reason it creates the strange codes when there is a break is caused by saveXML($child)

So when I have something like this,

$content = '<p><br/><a href="#">xx</a></p>
<p><br/><a href="#">xx</a></p>';

It will return something like this,

<p><a href="#">xx</a></p>&#13;
<p><a href="#">xx</a></p>

But I want something this actually,

<p><a href="#">xx</a></p>
<p><a href="#">xx</a></p>
hakre
  • 193,403
  • 52
  • 435
  • 836
Run
  • 54,938
  • 169
  • 450
  • 748
  • Possible duplicate of [How to saveHTML of DOMDocument without HTML wrapper?](https://stackoverflow.com/questions/4879946/how-to-savehtml-of-domdocument-without-html-wrapper) – miken32 Nov 02 '18 at 02:15

2 Answers2

3

If you're working on a fragment, you normally need only the body contents.

DomDocument in PHP does not offer something like innerHTML. You can simulate it however:

$innerHHTML = '';
$nodeBody = $dom->getElementsByTagName('body')->item(0);
foreach($nodeBody->childNodes as $child) {
  $innerHTML .= $nodeBody->ownerDocument->saveXML($child);
}

If you just want to repair a fragment, you can make use of the tidy library as well:

$html = tidy_repair_string($html, array('output-xhtml'=>1,'show-body-only'=>1));
hakre
  • 193,403
  • 52
  • 435
  • 836
  • got it thank you! I don't know how to incorporate `tidy_repair_string` into my code though... but the `innerHTML` simulation works perfectly! – Run Jul 27 '11 at 22:15
  • just found out that the `innerHTML` simulation generate something strange. Please see my edit above. Thanks. – Run Jul 27 '11 at 22:26
  • That looks like an encoding issue on your end. Ensure you only pump UTF-8 encoded strings into DomDocument. And you could normalize line-breaks as well before. However you should read into the tidy library, it has years of experience and deals with encodings and line-breaks as well. – hakre Jul 27 '11 at 23:02
  • I use `tidy_repair_string()` to fix this issue `$fragment = tidy_repair_string($dom->saveHTML(), array('output-xhtml'=>1,'show-body-only'=>1)); return $fragment;` – Run Jul 27 '11 at 23:09
  • I have to make sure that the server has this `php_tidy` turned on. this could be a problem on a live server as some of them may not have this configured... – Run Jul 27 '11 at 23:11
0

Hakre already mentioned the show-body-only option to HTML Tidy, which is probably what you want.

Ps. Here's the Tidy config file used by MediaWiki for pretty much just this purpose.

Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153