45

Assuming I have a sting which is "a s d d" and htmlentities turns it into
"a s d d".

How to replace (using preg_replace) it without encoding it to entities?

I tried preg_replace('/[\xa0]/', '', $string);, but it's not working. I'm trying to remove those special characters from my string as I don't need them

What are possibilities beyond regexp?

Edit String I want to parse: http://pastebin.com/raw/7eNT9sZr
with function preg_replace('/[\r\n]+/', "[##]", $text)
for later implode("</p><p>", explode("[##]", $text))

My question is not exactly "how" to do this (since I could encode entities, remove entities i don't need and decode entities). But how to remove those with just str_replace or preg_replace.

Grzegorz
  • 3,538
  • 4
  • 29
  • 47
  • `htmlentities` is prevention against xss. If you want to render in browser, the &nbsp will be evaluated as space only. If not then there is no use of the function – georoot Nov 21 '16 at 16:13
  • 2
    do you want to replace the spaces or the ` `? – Joshua Nov 21 '16 at 16:14
  • @georoot htmlentities prevents bad HTML output (ie. it ensures that information is emitted, not data), XSS is just maliciously crafted bad data. – user2864740 Nov 21 '16 at 16:14
  • `$string` == `a s d d` or `a s d d`? – chris85 Nov 21 '16 at 16:14
  • `htmlentities("a s d d")` outputs `"a s d d"` – Grzegorz Nov 21 '16 at 16:16
  • @user2864740 exactly my point. You use `htmlentities` if you want to render in browser in which case &nbsp doesn't make any difference. If you don't want to render in browser there is no use of the function – georoot Nov 21 '16 at 16:16
  • @georoot The information in HTML of " " and " " is different. One is a space. One is a non-breaking space. Only a non-breaking space is encoded as " ", not a normal space. – user2864740 Nov 21 '16 at 16:16
  • It's not for displaying, its for storing in database only. Only solution i can come up with atm is htmlenitities > str_replace > entities_decode chain – Grzegorz Nov 21 '16 at 16:17
  • @Grzegorz Use SQL parameterized queries for "storing to the database". In any case the input data *already contains* a not-a-normal-space. – user2864740 Nov 21 '16 at 16:17
  • @Grzegorz what is the point of using perl expressions? Why not to use str_replace in this case? – Victor Rudkov Nov 21 '16 at 16:18
  • http://pastebin.com/raw/7eNT9sZr Here is string I want to make into html. Replace multiple \r\n (which are divided by \xa0) and make pretty html. – Grzegorz Nov 21 '16 at 16:20
  • 2
    I think he is looking for a way to remove the non-breaking spaces from the string WITHOUT turning them into HTML entities first. – simon Nov 21 '16 at 16:22
  • In what encoding is your string? Is it UTF-8? If yes, I would say that non-breakable space is `0xc2a0` there. – David Ferenczy Rogožan Nov 21 '16 at 16:22
  • It's utf8. Also tried `\xc2a0` nothing is working and I'm wondering **WHY**. I want to know how it works, not how to do this :) – Grzegorz Nov 21 '16 at 16:24

3 Answers3

92

Problem Explanation

The reason why it's not working is that you are specifying the non-breaking space incorrectly.

The proper code for the non-breaking space in the UTF-8 encoding is 0xC2A0, it consists of two bytes - 0xC2 (194) and 0xA0 (160), so technically, you're specifying only the half of the character's code.

A Bit of Theory

Legacy character encodings were using the constant number of bits to encode every character in their set. For example, the original ASCII encoding was using 7 bits per character, extended ASCII 8 bits.

The UTF-8 encoding is so-called variable width character encoding, which means that the number of bits used to represent individual characters is variable, in the case of UTF-8, character codes consist of one up to four (8 bit) bytes (octets). In general, similarly to the Huffman coding, more frequently used characters have shorter codes while more rare characters have longer codes. That helps reduce the data size of the average text.

Solution

You can replace all occurences of the UTF-8 non-breaking space in text using a simple (and fast) str_replace or using a more flexible regular expression, depending on your needs:

// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);

// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);

Notes

Note that in case of str_replace, you have to use double quotes (") to enclose the search string because it doesn't understand the textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A for \n in UTF-8) before the string value is being used.

In contrast, the preg_replace function itself understands the textual representation of character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, ') to enclose the search string in this case.

David Ferenczy Rogožan
  • 23,966
  • 9
  • 79
  • 68
  • 2
    Note that `str_replace()` will work as well and is much faster. – simon Nov 21 '16 at 16:35
  • 1
    @simon Thank you, you're right. Added to my answer. – David Ferenczy Rogožan Nov 21 '16 at 16:42
  • 1
    I had no idea I have to write `\xc2\xa0` and wrote `\xc2a0`... my fail. Thank you! – Grzegorz Nov 21 '16 at 16:44
  • 1
    Maybe could you tell me how to replace it in group? `preg_replace('/[\x0E-\x1f]/', '', $string);`? – Grzegorz Nov 21 '16 at 16:45
  • 1
    @Grzegorz I'm not sure what you mean by that. Do you mean how to say that the codes in square brackets (`[\xc2\xa0]`) are a single character and not two? – David Ferenczy Rogožan Nov 21 '16 at 17:17
  • Encodings are not my strong point (utf8). For example I have `preg_replace('/[\x0E-\x1F\xc2\xa0]/', '', $string);` it would replace either `\xc2` and `\xa0` how to include it in regex so it only replaces `\xc2\xa0` and leaves `\xc2` intact? – Grzegorz Nov 22 '16 at 06:31
  • @DawidFerenczy Any idea? :P – Grzegorz Nov 22 '16 at 11:52
  • Sorry, I'm not sure about that. Did you solve it already? If not, you can probably create a new question to address that. – David Ferenczy Rogožan Mar 14 '17 at 13:03
  • @DavidFerenczyRogožan Is it possible to trim both these `\xc2\xa0` and space with the same `trim` function? like `trim($str, "\xC2\xA0")` or is there any way to do like that? – Saroj Shrestha Nov 07 '20 at 13:31
  • This was very helpful. However, in an ASCII file it's just \xA0. In Excel I was able to locate this character by searching for \xC2\xA0 but when I saved the file to CSV I found the character by searching for \xA0. – Jimbo Feb 10 '23 at 01:59
23

Sanitize every type of white spaces.

preg_replace("/\s+/u", " ", $str);

https://stackoverflow.com/a/40264711/635364

FYI, PHP Sanitization filter_var() has no filter about these white spaces.

Jehong Ahn
  • 1,872
  • 1
  • 19
  • 25
  • 4
    This is definitely the best option and should be the selected answer. – Moritz Friedrich Feb 06 '22 at 16:56
  • 2
    The only answer that worked for me! – user3382203 Apr 08 '22 at 08:29
  • 1
    This solution is also the most flexible – I had to `trim()` a string and non breakable spaces were not removed, so rather than replacing and then trimming I did a simple `preg_replace("/^\s+|\s+$/u", "", $str)`. – Francesco Marchetti-Stasi May 17 '23 at 08:56
  • AFAICS, this will replace all consequent whitespace into one. You may need to remove the `+` after `\s` to keep the number of spaces the same. Also, I'm not sure about this, but this may remove the line breaks too. – Taha Paksu Aug 29 '23 at 10:16
0

Select the right charset of your string

$yourCharset='UTF-8'; // or 'ISO8859-1', or...

Use the return value of html_entity_decode to replace.

$string=str_replace(html_entity_decode('&nbsp;',ENT_COMPAT,$yourCharset),' ',$string);
FrancescoMM
  • 2,845
  • 1
  • 18
  • 29