5

I have a string like that (it's an empty paragraph) saved from my heavily edited and after-processed input from TinyMCE.

That is how it looks like after echo, in HTML source code in browser:

<p> </p>

Now, I need to remove those empty paragraphs.

I have already tried

$output = str_ireplace("<p> </p>", "", $string);
$output = preg_replace("/<p> <\/p>/", "", $string);
$output = preg_replace("/<p>[ \t\n\r]*<\/p>/", "", $string);
$output = preg_replace("/<p>[\s]*<\/p>/", "", $string);

and many more variations with no luck. It's still there, intact. I have also tried mb_ereg_replace and matching &nbsp; which isn't apparently the case.

On the other hand, this works:

$output = preg_replace("/<p>.*<\/p>/", "", $string);

but of course striping also paragraphs with actual content.

What else could that "space-like" character be? How am I supposed to match it?

SOLVED Thanks to Ibizaman and this thread link, I've found the character. It is nbsp in unicode value. See http://unicodelookup.com/#160/1

This works:

$output = preg_replace("/<p>[\x{00A0}\s]*<\/p>/u", "", $string);

As pointed by mcrumley, this might work even better:

"/<p>[\p{Zs}\s]*<\/p>/iu"
Community
  • 1
  • 1
Saix
  • 103
  • 9
  • Are you sure its not your browser simply adding the space in the HTML source for display purposes? What if you save the page and view it in a text editor such as [notepadd++](http://notepad-plus-plus.org/)? – Mike Nov 20 '13 at 13:21
  • "/

    [^a-zA-Z0-9]*<\/p>/" should do it, although it's maybe too restrictive. The rationale is that a ^ at the beginning of the brackets negates it.

    – ibizaman Nov 20 '13 at 13:23
  • What about `

    [^<]*<\/p>`... Anyway, check the page source to be sure... I remember last time, a similar situation made me crazy :S

    – Enissay Nov 20 '13 at 13:26
  • Using `#

    \s*

    #` should work. What's the exact output ? Could you give the hex value of those spaces ? A wild guess, try to use the `u` modifier `#

    \s*

    #u`
    – HamZa Nov 20 '13 at 13:26
  • @Mike: it looks like a normal space after saving and opening in PsPad for example; I have also those

    in the string and those are easily removed
    – Saix Nov 20 '13 at 13:34
  • @Enissay: it does remove also paragraphs with content – Saix Nov 20 '13 at 13:35
  • @HamZa: either one isn't working – Saix Nov 20 '13 at 13:37
  • @ibizaman: "/

    [^a-zA-Z0-9]*<\/p>/" is a good idea, it works as a nice workaround, but I might have to enhance it a bit, thanks. At least something.

    – Saix Nov 20 '13 at 13:38
  • 2
    @Saix yes indeed, that's why I said it's too restrictive. Try parsing you string with functions outputting unicode values (see [this](http://stackoverflow.com/questions/9361303/can-i-get-the-unicode-value-of-a-character-or-vise-versa-with-php) SO question). Then you'll see what's really going on. – ibizaman Nov 20 '13 at 13:40
  • 3
    @ibizaman: I've found the bastard... [link](http://unicodelookup.com/#160/1) – Saix Nov 20 '13 at 14:18
  • 2
    Your character class is not doing exactly what you think it is doing. `[\x{00a0}|\s]` matches non-breaking space, white space, and the pipe character "|". You can take out the "|". – mcrumley Nov 20 '13 at 18:41

3 Answers3

3

You can use the Unicode character property to match all spaces. \p{Zs} is "Space separator" and includes space, non-breaking space, thin space, etc. You can also use \pZ to match all separators, including line separator and paragraph separator. See http://www.php.net/manual/en/regexp.reference.unicode.php for details.

$output = preg_replace("/<p>[\p{Zs}\s]*<\/p>/iu", "", $string);
mcrumley
  • 5,682
  • 3
  • 25
  • 33
2

Since you don't know which character is being outputted, first parse the output of $string with functions outputting unicode values (see this SO question).

Or, you can proceed the other way around and only accept well-formed paragraphs:

$output = preg_replace("/(<p>[^a-zA-Z0-9]*<\/p>)/", "\1", $string);

Disclaimer : I already put this in comments but since it solved the problem, it's better placed in an answer for future reference, I think.

Community
  • 1
  • 1
ibizaman
  • 3,053
  • 1
  • 23
  • 34
0

A 'space-like character' is \s, which would make your entire line

$output = preg_replace("/<p>\s*<\/p>/", "", $string);

See an example on regex101.com.

SQB
  • 3,926
  • 2
  • 28
  • 49
  • 1
    He already tried that `

    [\s]*<\/p>`, and no you don't need to escape the backslash in this case.

    – HamZa Nov 20 '13 at 13:20
  • Just to be sure, I have tried this one too. Doesn't work either. – Saix Nov 20 '13 at 13:23
  • Well, it _does_ work. I'll update my answer with a link to regex101.com – SQB Nov 20 '13 at 13:24
  • @SQB well of course it would work for "normal spaces" but the OP surely doesn't have normal spaces `0x20` or there is another problem in the code (logic). – HamZa Nov 20 '13 at 13:29