1

Hello got the following problems

First: if ÄÖÜ is provided with the regex the trademark sign replacement is producing a addional char �.:

Second: If I do a string loop through the result all special chars are �.

Question is, why is this happen, and what can I do against it? (Second question is not so nessesasary but interesting)

header('Content-Type: text/html; charset=utf-8');
$testtxt = 'MicrÖsüft W!ndows® is a trÄdemark of Microfrost™ ©2012!';

$r =  preg_replace('#[^\w\s\däöüß%\!\?\.,\:\-_\[\]ÄÖÜ]#is', 'X', $testtxt);
echo $testtxt, '<br>', $r;
echo '<hr>';
for($i = 0, $size = strlen($r); $i < $size; ++$i) {
    echo $r[$i], '=', ord($r[$i]), '<br>';
}

Result:

MicrÖsüft W!ndows® is a trÄdemark of Microfrost™ ©2012!
MicrÖsüft W!ndowsXX is a trÄdemark of MicrofrostX�X XX2012!
M=77
i=105
c=99
r=114
�=195
�=150....

Expected:

MicrÖsüft W!ndows® is a trÄdemark of Microfrost™ ©2012!
MicrÖsüft W!ndowsXX is a trÄdemark of MicrofrostXX XX2012!
M=77
i=105
c=99
r=114
Ö=195
s=150....
Mr. Foo
  • 11
  • 1

2 Answers2

2

You're using strlen and ord functions which are not compatible with multibyte characters. The following code should show you the amount of bytes per character:

for($i = 0, $size = mb_strlen($r); $i < $size; ++$i) {
    echo $r[$i], '=', strlen($r[$i]), '<br>';
}

Second, you should add the UTF-8 modifier to your regexp:

$r =  preg_replace('#[^\w\s\däöüß%\!\?\.,\:\-_\[\]ÄÖÜ]#isu', 'X', $testtxt)
Yosh
  • 145
  • 3
0

I'm not really sure, but you could try adding the "u" modifier to the regex in preg_replace(). Or try using mb_eregi_replace().

Also, are you sure it's okay to use ord()? It is written in the manual that it returns the ASCII value (which, I strongly suspect, means that for multibyte characters it'll return the value of the first byte).

Exander
  • 852
  • 1
  • 7
  • 17
  • Hmm, what really makes me wonder is, that you can not access a char via the $string[$position] command. I thought that this is a build in language function. After reconsidering I think the only way to loop trough a utf-8 string is via mb_substr. – Mr. Foo Jun 25 '12 at 18:31