4

I'm having a problem where PHP (5.2) cannot find the character 'Â' in a string, though it is clearly there.

I realize the underlying problem has to do with character encoding, but unfortunately I have no control over the source content. I receive it as UTF-8, with those characters already in the string.

I would simply like to remove it from the string. strpos(), str_replace(), preg_replace(), trim(), etc. Cannot correctly identify it.

My string is this:

"Â  Â  Â  A lot of couples throughout the World "

If I do this:

$string = str_replace('Â','',$string);

I get this:

"� � � A lot of couples throughout the World"

I even tried utf8_encode() and utf8_decode() before the str_replace, with no luck.

What's the solution? I've been throwing everything I can find at it...

Travis
  • 167
  • 2
  • 3
  • 12

3 Answers3

4
$string = str_replace('Â','',$string);

How is this 'Â' encoded? If your script file is saved as iso-8859-1 the string 'Â' is encoded as the one byte sequence xC2 while the (/one) utf-8 representation is xC3 x82. php's str_replace() works on the byte level, i.e. it only "knows" single-byte characters.

see http://docs.php.net/intro.mbstring

VolkerK
  • 95,432
  • 20
  • 163
  • 226
  • +1, you can therefore write the replace as: `str_replace(chr(195) . chr(130), '', $string)`... (where `195` and `130` are `xC3` and `x82` converted from Hex to decimal, respectively)... Or, since PHP supports hex numbers: `str_replace(chr(0xC3), chr(0x82), '', $string)`... – ircmaxell Aug 27 '10 at 19:39
  • I also found that mb_ereg_replace() didn't seem to work properly; Isn't this its purpose? Your information is extremely useful and I'll be sure to read the documentation you linked. Thanks! – Travis Aug 27 '10 at 20:10
  • @Travis: The parameters you pass to the mbstring functions have to be encoded properly as well. If you have a string literal in your script (like 'Â') then the encoding depends on how you've saved the script file. – VolkerK Aug 27 '10 at 23:37
3

I use this:

function replaceSpecial($str){
$chunked = str_split($str,1);
$str = ""; 
foreach($chunked as $chunk){
    $num = ord($chunk);
    // Remove non-ascii & non html characters
    if ($num >= 32 && $num <= 123){
            $str.=$chunk;
    }
}   
return $str;
} 
KeatsKelleher
  • 10,015
  • 4
  • 45
  • 52
  • You can expand this to allow all ascii characters by changing 32 to 0 and 123 to 255. – KeatsKelleher Aug 27 '10 at 19:16
  • This will remove MANY more characters than just accents. – shamittomar Aug 27 '10 at 19:17
  • 5
    First off, the only ASCII overlap is between 0 and 127. If you allow character 128 or higher, you'll break the encoding (this is due to the multi-byte nature of UTF-8). However, this is a quite dirty method of doing that. What I would do if I was you, is simply use the [`iconv`](http://us3.php.net/manual/en/book.iconv.php) function if you need to convert to ASCII... `$str = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string)`, especially since it'll transliterate characters for you... – ircmaxell Aug 27 '10 at 19:49
  • Ahh.. I think I understand the solution, but I'm still not clear why PHP doesn't recognize the characters? I think I'll use something like this, but only strip a few specific chars. Thanks! – Travis Aug 27 '10 at 20:14
1

From the PHP Manual Comment Page:

http://www.php.net/manual/en/function.preg-replace.php#96847

And from StackOverflow:

Remove accents without using iconv

Community
  • 1
  • 1
shamittomar
  • 46,210
  • 12
  • 74
  • 78