non-breaking utf-8 0xc2a0 space and preg_replace strange behaviour

Question

In my string I have utf-8 non-breaking space (0xc2a0) and I want to replace it with something else.

When I use

$str=preg_replace('~\xc2\xa0~', 'X', $str);

it works OK.

But when I use

$str=preg_replace('~\x{C2A0}~siu', 'W', $str);

non-breaking space is not found (and replaced).

Why? What is wrong with second regexp?

The format \x{C2A0} is correct, also I used u flag.

I came here looking for $str=preg_replace('~\xc2\xa0~', 'X', $str); It's the first time a question has answered by question. — Gearóid Ó Ceallaigh, Oct 31 '22 at 10:29

Newbo.O · Accepted Answer · 2014-08-23T17:14:32.183

62

Actually the documentation about escape sequences in PHP is wrong. When you use \xc2\xa0 syntax, it searches for UTF-8 character. But with \x{c2a0} syntax, it tries to convert the Unicode sequence to UTF-8 encoded character.

A non breaking space is U+00A0 (Unicode) but encoded as C2A0 in UTF-8. So if you try with the pattern ~\x{00a0}~siu, it will work as expected.

edited Aug 23 '14 at 17:14

answered Oct 11 '12 at 11:10

Newbo.O

1,948
1
13
14

1

Hi Newbo. Your answer worked for me, but I still don't understand why. Is it because my nbsp is not UTF-8? My data is coming from a database table with utf8_general_ci character set, so it should be UTF-8 (my character_set_client and character_set_connection are also UTF-8). Do you have a link for more information on this? Thanks. – Buttle Butkus Jul 18 '13 at 22:53
3

[This article](http://rrn.dk/the-difference-between-utf-8-and-unicode) is great to understand more on this subject. There's also [this SO question](http://stackoverflow.com/questions/3951722/whats-the-difference-between-unicode-and-utf8) where the former article has been copy/pasted. – Newbo.O Jul 19 '13 at 09:25

score 14 · Answer 2 · edited May 10 '23 at 19:52

I've aggregated previous answers so people can just copy / paste following code to choose their favorite method :

$some_text_with_non_breaking_spaces = "some text with 2 non breaking spaces at the beginning";
echo 'Qty non-breaking space : ' . substr_count($some_text_with_non_breaking_spaces, "\xc2\xa0") . '<br>';
echo $some_text_with_non_breaking_spaces . '<br>';

# Method 1 : regular expression
$clean_text = preg_replace('~\x{00a0}~siu', ' ', $some_text_with_non_breaking_spaces);

# Method 2 : convert to bin -> replace -> convert to hex
$clean_text = hex2bin(str_replace('c2a0', '20', bin2hex($some_text_with_non_breaking_spaces)));

# Method 3 : my favorite
$clean_text = str_replace("\xc2\xa0", " ", $some_text_with_non_breaking_spaces);

echo 'Qty non-breaking space : ' . substr_count($clean_text, "\xc2\xa0"). '<br>';
echo $clean_text . '<br>';

The `hex2bin()` variant is dangerous, it will wrongly replace mis-aligned occurences. Consider the hex sequence `0c2a0a` for example. — jlh, May 06 '19 at 13:41

score 3 · Answer 3 · edited Sep 10 '18 at 11:48

3

The two codes do different things in my opinion: the first \xc2\xa0 will replace TWO characters, \xc2 and \xa0 with nothing.

In UTF-8 encoding, this happens to be the codepoint for U+00A0.

Does \x{00A0} work? This should be the representation for \xc2\xa0.

edited Sep 10 '18 at 11:48

smottt

3,272
11
37
44

answered Oct 11 '12 at 11:12

DThought

1,340
7
18

score 1 · Answer 4 · answered Jul 24 '14 at 08:56

I did not work this variant ~\x{c2a0}~siu.

Varian \x{00A0} works. I have not tried the second option and here is the result:

I tried to convert it to hex and replace no-break space 0xC2 0xA0 (c2a0) to space 0x20 (20).

Code:

$hex = bin2hex($item);
$_item = str_replace('c2a0', '20', $hex);
$item = hex2bin($_item);

EllisGL · Answer 5 · 2018-04-02T02:35:06.783

0

/\x{00A0}/, /\xC2\xA0/ and $clean_hex2bin-str_replace-bin2hex worked and didn't work. If I printed it out to the screen, it's all good, but if I tried to save it to a file, the file would be blank!

I ended up using iconv('UTF-8', 'ISO-8859-1//IGNORE', $str);

edited Apr 02 '18 at 02:35

answered Apr 02 '18 at 02:28

EllisGL

125
1
6

non-breaking utf-8 0xc2a0 space and preg_replace strange behaviour

5 Answers5

Linked

Related