0

I have the follow regex:

$html = '<p></p><p>Lorem ispum...</p><p>  </p><p>;nbsp</p>';
$pattern = "/<p[^>]*><\\/p[^>]*>/";
echo preg_replace($pattern, '', $html );

This only removes the <p> tag if it's actually empty, i.e. <p></p>. How do I remove it if it has some other invisible copy in it, such as &nbsp;?

Sinister Beard
  • 3,570
  • 12
  • 59
  • 95
Hai Truong IT
  • 4,126
  • 13
  • 55
  • 102

2 Answers2

0

There are several possible kinds of whitespace and even more possibilities for "empty" (e.g., is <p><em></em></p> empty? Or not?).

Also consider the possibility of having <p class="para"> or <p id="chief">.

Much depends on where the text comes from. Microsoft Word will output &#160;'s in some circumstances (I could and did unremember them -- sorry).

A reasonable possibility for now might be to use a regex such as #<p>(\\s|&nbsp;)*</p>#mis' to match multiple empty lines.

But keep in mind that this kind of requisite tends to rapidly become unreasonable - for example the class part might force you to use #<p[^>]*>(\\s|&nbsp;)*</p>#mis' and so on - so, you might want to start looking into a XML parser instead.

Community
  • 1
  • 1
LSerni
  • 55,617
  • 10
  • 65
  • 107
0

I assume by backspace, you mean whitespace, and that ;nbsp& should be &nbsp; and propose:

$pattern = "/<p[^>]*>(\s|&nbsp;)*<\\/p[^>]*>/";

\s mathes any whitespace character

The pattern mathes \s OR (|) &nbsp; ANY (*) number of times inside the <p> tags.