0

I need some help stripping empty tags in my HTML. There is a solution here:

Remove empty tags using RegEx

But I can't use JS, and I should never use Regular expressions to parse HTML.

I need to clean inputs with PHP, and I also need to get more than just empty tags.

I also need to catch tags like this:

<p> </p> (variable whitespace with nothing in the tag)
<p>&nbsp;</p>
<p><br/><p>
<p><br /></p>

What can I do to catch bad markup like that before it makes it to the database (WYSIWYGs)?

Community
  • 1
  • 1
Kevin
  • 13,153
  • 11
  • 60
  • 87

3 Answers3

4

Parse it with a document object model parser, check the text content of nodes, remove nodes that don't meet your criteria (parses as a script tag, contains whitespace, is an iframe, etc).

Quite a lot of sample code in the comments section as well.

Here's a bunch of code that does something like that (adopted from random cut+paste on php.net)

<?php

$sampleHTML = "
<p>  </p>
<p> &nbsp;   <p>
<p><br/></p>
<p><br /></p>
<span>Non-empty span<p id='NestedEmptyElement'></p></span>
";

$doc = new DOMDocument();
$doc->loadHTML($sampleHTML);
$domNodeList = $doc->getElementsByTagname('*');
$domElemsToRemove = array();
foreach ( $domNodeList as $domElement ) {
  $domElement->normalize();
  if (trim($domElement->textContent, "\xc2\xa0 \n \t ") == "") {
    $domElemsToRemove[] = $domElement;
  }
}

foreach( $domElemsToRemove as $domElement ){
    try {
      $domElement->parentNode->removeChild($domElement);
    } catch (Exception $e) {
      //node was already deleted.
      //There's a better way to do this, it's recursive.
    }
}


$domNodeList = $doc->getElementsByTagname('body')->item(0);
$childNodes = $domNodeList->childNodes;

foreach ( $childNodes as $domElement ) {
  echo trim($domElement->C14N());
}

echo "\n\n";

Then we run..

$ php foo.php -v
<span>Non-empty span</span>
Incognito
  • 20,537
  • 15
  • 80
  • 120
2

That matches your examples and a little more:

^<p>\s*(?:(?:&nbsp;|<br\s*/>)\s*)*</p>$

But are you looking only for p tags? Can there be several per line?

Yet another use of normal* (special normal*)* with:

  • normal: \s,
  • special: (&nbsp;|<br\s*/>)

(with non capturing groups)

fge
  • 119,121
  • 33
  • 254
  • 329
0

I worked on this for about a day and saw a lot of "dont use regex" which I agree with.

I however had huge problems with DOMDocument messing with my html entities. I would carefully filter text so that all TM symbols were converted to HTML entities such as &trade; but it would convert them back to the TM symbol.

I battled with preventing this behavior for some time. There were some hacks mentioned for this. After a day of battling I thought "why should I work so hard to hack it to work? It should just work.." then I wrote this function using simplehtmldom in like 10 minutes:

function stripEmptyTags($html){
    $dom = new simple_html_dom();
    $dom->load($html);
    foreach($dom->find("*") as $e)
        if( trim( str_replace( array(' ','&nbsp;'), "", $e->innertext )) == "" ) 
            $e->outertext = "";
    $dom->load($dom->save());
    return $dom->save();
}
JaseC
  • 3,103
  • 2
  • 21
  • 22