Match special kind of whitespace

Question

I have a string like that (it's an empty paragraph) saved from my heavily edited and after-processed input from TinyMCE.

That is how it looks like after echo, in HTML source code in browser:

<p> </p>

Now, I need to remove those empty paragraphs.

I have already tried

$output = str_ireplace("<p> </p>", "", $string);
$output = preg_replace("/<p> <\/p>/", "", $string);
$output = preg_replace("/<p>[ \t\n\r]*<\/p>/", "", $string);
$output = preg_replace("/<p>[\s]*<\/p>/", "", $string);

and many more variations with no luck. It's still there, intact. I have also tried mb_ereg_replace and matching   which isn't apparently the case.

On the other hand, this works:

$output = preg_replace("/<p>.*<\/p>/", "", $string);

but of course striping also paragraphs with actual content.

What else could that "space-like" character be? How am I supposed to match it?

SOLVED Thanks to Ibizaman and this thread link, I've found the character. It is nbsp in unicode value. See http://unicodelookup.com/#160/1

This works:

$output = preg_replace("/<p>[\x{00A0}\s]*<\/p>/u", "", $string);

As pointed by mcrumley, this might work even better:

"/<p>[\p{Zs}\s]*<\/p>/iu"

Are you sure its not your browser simply adding the space in the HTML source for display purposes? What if you save the page and view it in a text editor such as [notepadd++](http://notepad-plus-plus.org/)? — Mike, Nov 20 '13 at 13:21
"/
[^a-zA-Z0-9]*<\/p>/" should do it, although it's maybe too restrictive. The rationale is that a ^ at the beginning of the brackets negates it. — ibizaman, Nov 20 '13 at 13:23
What about `
[^<]*<\/p>`... Anyway, check the page source to be sure... I remember last time, a similar situation made me crazy :S — Enissay, Nov 20 '13 at 13:26
Using `#
\s*
#` should work. What's the exact output ? Could you give the hex value of those spaces ? A wild guess, try to use the `u` modifier `#
\s*
#u` — HamZa, Nov 20 '13 at 13:26
@Mike: it looks like a normal space after saving and opening in PsPad for example; I have also those
in the string and those are easily removed — Saix, Nov 20 '13 at 13:34
@ibizaman: "/
[^a-zA-Z0-9]*<\/p>/" is a good idea, it works as a nice workaround, but I might have to enhance it a bit, thanks. At least something. — Saix, Nov 20 '13 at 13:38
@Saix yes indeed, that's why I said it's too restrictive. Try parsing you string with functions outputting unicode values (see [this](http://stackoverflow.com/questions/9361303/can-i-get-the-unicode-value-of-a-character-or-vise-versa-with-php) SO question). Then you'll see what's really going on. — ibizaman, Nov 20 '13 at 13:40
@ibizaman: I've found the bastard... [link](http://unicodelookup.com/#160/1) — Saix, Nov 20 '13 at 14:18
Your character class is not doing exactly what you think it is doing. `[\x{00a0}|\s]` matches non-breaking space, white space, and the pipe character "|". You can take out the "|". — mcrumley, Nov 20 '13 at 18:41

mcrumley · Answer 1 · 2013-11-20T18:45:23.813

3

You can use the Unicode character property to match all spaces. \p{Zs} is "Space separator" and includes space, non-breaking space, thin space, etc. You can also use \pZ to match all separators, including line separator and paragraph separator. See http://www.php.net/manual/en/regexp.reference.unicode.php for details.

$output = preg_replace("/<p>[\p{Zs}\s]*<\/p>/iu", "", $string);

edited Nov 20 '13 at 18:45

answered Nov 20 '13 at 18:38

mcrumley

5,682
3
25
33

score 2 · Accepted Answer · edited May 23 '17 at 11:56

Since you don't know which character is being outputted, first parse the output of $string with functions outputting unicode values (see this SO question).

Or, you can proceed the other way around and only accept well-formed paragraphs:

$output = preg_replace("/(<p>[^a-zA-Z0-9]*<\/p>)/", "\1", $string);

Disclaimer : I already put this in comments but since it solved the problem, it's better placed in an answer for future reference, I think.

SQB · Answer 3 · 2013-11-20T13:35:15.400

0

A 'space-like character' is \s, which would make your entire line

$output = preg_replace("/<p>\s*<\/p>/", "", $string);

See an example on regex101.com.

edited Nov 20 '13 at 13:35

answered Nov 20 '13 at 13:19

SQB

3,926
2
28
49

1

He already tried that `
[\s]*<\/p>`, and no you don't need to escape the backslash in this case.
– HamZa Nov 20 '13 at 13:20
Just to be sure, I have tried this one too. Doesn't work either. – Saix Nov 20 '13 at 13:23
Well, it _does_ work. I'll update my answer with a link to regex101.com – SQB Nov 20 '13 at 13:24
@SQB well of course it would work for "normal spaces" but the OP surely doesn't have normal spaces `0x20` or there is another problem in the code (logic). – HamZa Nov 20 '13 at 13:29

Match special kind of whitespace

3 Answers3