Remove empty HTML from a document

Question

I need some help stripping empty tags in my HTML. There is a solution here:

But I can't use JS, and I should never use Regular expressions to parse HTML.

I need to clean inputs with PHP, and I also need to get more than just empty tags.

I also need to catch tags like this:

<p> </p> (variable whitespace with nothing in the tag)
<p>&nbsp;</p>
<p><br/><p>
<p><br /></p>

What can I do to catch bad markup like that before it makes it to the database (WYSIWYGs)?

Is your input 'valid XHTML' ? If so, an xslt can be a solution for your case. — Vincent Biragnet, Dec 22 '11 at 15:58
How else do you clean input from WYSIWYGs in forms? Multiple str_replace for each case? — Kevin, Dec 22 '11 at 16:31
Why wouldn't it? It shouldn't enter empty tags like that. It messes up style on elements like p tags that might have margin bottom, for example. — Kevin, Dec 22 '11 at 17:43

Incognito · Accepted Answer · 2011-12-22T21:07:04.060

Parse it with a document object model parser, check the text content of nodes, remove nodes that don't meet your criteria (parses as a script tag, contains whitespace, is an iframe, etc).

Quite a lot of sample code in the comments section as well.

Here's a bunch of code that does something like that (adopted from random cut+paste on php.net)

<?php

$sampleHTML = "
<p>  </p>
<p> &nbsp;   <p>
<p><br/></p>
<p><br /></p>
<span>Non-empty span<p id='NestedEmptyElement'></p></span>
";

$doc = new DOMDocument();
$doc->loadHTML($sampleHTML);
$domNodeList = $doc->getElementsByTagname('*');
$domElemsToRemove = array();
foreach ( $domNodeList as $domElement ) {
  $domElement->normalize();
  if (trim($domElement->textContent, "\xc2\xa0 \n \t ") == "") {
    $domElemsToRemove[] = $domElement;
  }
}

foreach( $domElemsToRemove as $domElement ){
    try {
      $domElement->parentNode->removeChild($domElement);
    } catch (Exception $e) {
      //node was already deleted.
      //There's a better way to do this, it's recursive.
    }
}


$domNodeList = $doc->getElementsByTagname('body')->item(0);
$childNodes = $domNodeList->childNodes;

foreach ( $childNodes as $domElement ) {
  echo trim($domElement->C14N());
}

echo "\n\n";

Then we run..

$ php foo.php -v
<span>Non-empty span</span>

fge · Answer 2 · 2011-12-22T16:08:52.130

2

That matches your examples and a little more:

^<p>\s*(?:(?:&nbsp;|<br\s*/>)\s*)*</p>$

But are you looking only for p tags? Can there be several per line?

Yet another use of normal* (special normal*)* with:

normal: \s,
special: ( |<br\s*/>)

(with non capturing groups)

edited Dec 22 '11 at 16:08

answered Dec 22 '11 at 15:56

fge

119,121
33
254
329

score 0 · Answer 3 · answered Sep 24 '13 at 00:36

I worked on this for about a day and saw a lot of "dont use regex" which I agree with.

I however had huge problems with DOMDocument messing with my html entities. I would carefully filter text so that all TM symbols were converted to HTML entities such as ™ but it would convert them back to the TM symbol.

I battled with preventing this behavior for some time. There were some hacks mentioned for this. After a day of battling I thought "why should I work so hard to hack it to work? It should just work.." then I wrote this function using simplehtmldom in like 10 minutes:

function stripEmptyTags($html){
    $dom = new simple_html_dom();
    $dom->load($html);
    foreach($dom->find("*") as $e)
        if( trim( str_replace( array(' ','&nbsp;'), "", $e->innertext )) == "" ) 
            $e->outertext = "";
    $dom->load($dom->save());
    return $dom->save();
}

Why `$dom->load($dom->save());`? – Sarah Trees Feb 13 '22 at 16:37 — Sarah Trees, Feb 13 '22 at 16:37

Remove empty HTML from a document

3 Answers3

Linked