6

I was scouring through SO answers and found that the solution that most gave for replacing multiple spaces is:

$new_str = preg_replace("/\s+/", " ", $str);

But in many cases the white space characters include UTF characters that include line feed, form feed, carriage return, non-breaking space, etc. This wiki describes that UTF defines twenty-five characters defined as whitespace.

So how do we replace all these characters as well using regular expressions?

Adam Ranganathan
  • 1,691
  • 1
  • 17
  • 25

3 Answers3

11

When passing u modifier, \s becomes Unicode-aware. So, a simple solution is to use

$new_str = preg_replace("/\s+/u", " ", $str);
                             ^^

See the PHP online demo.

Graham
  • 7,431
  • 18
  • 59
  • 84
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Yes, it becomes aware but it does not replace spaces like non-breaking spaces. We need to specify those characters specifically. For instance, try your solution with this string: `$str = "Hello there".str_repeat(json_decode('"\u00A0"'),10)."Bob!";` The string has 10 spaces which are non-breaking represented by utf code `00A0`. You can try echo that string first to see what it does. – Adam Ranganathan Oct 26 '16 at 14:45
  • Non-breaking space is in my sample text. It *is* replaced. And here is [a demo](http://ideone.com/HkGNLh) with your example string above. Also leaving just 1 space inside. – Wiktor Stribiżew Oct 26 '16 at 14:46
  • I tried this code based on yours, but I am not getting the result. Am I missing something? `$utf = "Hello there".str_repeat(json_decode('"\u00A0"'),10)."Bob!"; $new_str = preg_replace("/\s+/u", " ", $utf); echo $new_str;` – Adam Ranganathan Oct 26 '16 at 14:49
  • Again, see http://ideone.com/I8qnpV. You should check if your environment is set to work with UTF correctly. – Wiktor Stribiżew Oct 26 '16 at 14:50
  • Ok that's strange. Would you know what could possibly be wrong with my environment? The solution I have posted works in my present environment. It detects the UTF chars. But your solution is not giving the same result. Any idea where I could read more on this? – Adam Ranganathan Oct 26 '16 at 14:54
  • There are various reasons. Please check http://stackoverflow.com/questions/1605760/how-to-best-configure-php-to-handle-a-utf-8-website first. – Wiktor Stribiżew Oct 26 '16 at 14:56
1

The first thing to do is to read this explanation of how unicode can be treated in regex. Coming specifically to PHP, we need to first of all include the PCRE modifier 'u' for the engine to recognize UTF characters. So this would be:

$pattern = "/<our-pattern-here>/u";

The next thing is to note that in PHP unicode characters have the pattern \x{00A0} where 00A0 is hex representation for non-breaking space. So if we want to replace consecutive non-breaking spaces with a single space we would have:

$pattern = "/\x{00A0}+/u";
$new_str = preg_replace($pattern," ",$str);

And if we were to include other types of spaces mentioned in the wiki like:

  • \x{000D} carriage return
  • \x{000C} form feed
  • \x{0085} next line

Our pattern becomes:

$pattern = "/[\x{00A0}\x{000D}\x{000C}\x{0085}]+/u";

But this is really not great since the regex engine will take forever to find out all combinations of these characters. This is because the characters are included in square brackets [ ] and we have a + for one or more occurrences.

A better way to then get faster results is by replacing all occurrences of each of these characters by a normal space first. And then replacing multiple spaces with a single normal space. We remove the [ ]+ and instead separate the characters with the or operator | :

$pattern = "/\x{00A0}|\x{000D}|\x{000C}|\x{0085}/u";
$new_str = preg_replace($pattern," ",$str); // we have one-to-one replacement of character by a normal space, so 5 unicode chars give 5 normal spaces
$final_str = preg_replace("/\s+/", " ", $new_str); // multiple normal spaces now become single normal space
Adam Ranganathan
  • 1,691
  • 1
  • 17
  • 25
1

A pattern that matches all Unicode whitespaces is [\pZ\pC]. Here is a unit test to prove it.

If you're parsing user input in UTF-8 and need to normalize it, it's important to base your match on that list. So to answer your question that would be:

$new_str = preg_replace("/[\pZ\pC]+/u", " ", $str);