6

So I'm having a problem where I believe what's happening is I'm receiving data that uses some unicode spaces and some ascii spaces, such that certain strings that appear the same are not equivalent, for example, "water resistant" != "water resistant". These strings appear differently in my database, however, with the weird characters you normally see when there's a multibyte character: "water resistantÂ" and " water resistant".

I would like a way to make all spaces be ascii spaces, or if easier, all spaces be multibyte spaces.

I've tried using preg_replace, but then the strings no longer read like valid multibyte strings anymore. (Multibyte characters in the strings will appear as garbage).

preg_replace('/[\pZ\pC]/',' ',$field);

I've also tried using mb_ereg_replace, but it had no effect.

mb_ereg_replace('/[\pZ\pC]/',' ',$field)
Kai
  • 3,803
  • 1
  • 16
  • 33

4 Answers4

9

You can find and replace them with standard ascii spaces if you wanted via:

$string = str_replace("\xc2\xa0", "\x20", $string);
Rob Evans
  • 6,750
  • 4
  • 39
  • 56
  • i consider this solution the best -- simple and functional. worked for me in my WordPress content_save_pre() filter to kill non-breaking spaces where a user typed two consecutive spaces in whatever content editor they were using (like Word) -- which converts one of the spaces to a non-breaking space to preserve 2-spaces. since we're not using typewriters, it's absurd to 2-space -- besides, it's type flow hell in a browser – aequalsb Mar 30 '15 at 17:55
  • This did not work for me, but @Kai's answer did work. – Sithu Aug 11 '15 at 09:13
5

It looks like preg_replace('/[\pZ\pC]/u',' ',$field); works (forgot the u at the end of the regex)

Kai
  • 3,803
  • 1
  • 16
  • 33
  • 1
    This works for me, but note that it seems to be a little more aggressive than may be desired. The regex provided also matches a "standard" ASCII space. So, if you're trying to replace *only* Unicode non-breaking spaces (e.g. with a non-space character), this will replace more characters than you intend. – rinogo Nov 17 '16 at 15:37
2

I think you're looking for utf8_decode($field).

Joren
  • 3,068
  • 25
  • 44
  • Agreed - That works for me with ISO-8859-1 to UTF-8 for my database. – Mat Carlson Nov 20 '13 at 19:55
  • If I call utf8_decode($field), the field will still appear with garbage characters when displayed on the webpage. I also need to fix the spaces problem before I save to the database, because otherwise it will store duplicates of "water resistant" with various white spaces, rather than just a single entry "water resistant". – Kai Nov 20 '13 at 20:12
0

Those spaces that you call unicode spaces are non-breaking spaces ( what & nbsp; stands for ).

When saving the data, you have to clean it first. Replace all non-breaking spaces by normal spaces, replace double spaces by single and finally trim the string.

Lorenz Meyer
  • 19,166
  • 22
  • 75
  • 121