1

I like to match some specific UTF8 chars. In my case German Umlauts. Thats our example code:

{UTF-8 file}
<?php
$search = 'ä,ö,ü';
$replace = 'ae,oe,ue';
$string = str_replace(explode(',', $search), explode(',', $replace), $string);
?>

This code is UTF-8. Now I like to ensure that this will work independent of (most) used charsets of the code.

Is this the way I should go (used UTF-8 check)?

{ISO file}
<?php
$search = 'ä,ö,ü';
$search = preg_match('~~u', $search) ? $search : utf8_encode($search);
$replace = 'ae,oe,ue';
$string = str_replace(explode(',', $search), explode(',', $replace), $string);
?>
Community
  • 1
  • 1
mgutt
  • 5,867
  • 2
  • 50
  • 77

1 Answers1

1
  1. You should be in control of what your source code is encoded as, it'd be very weird to suddenly have its encoding change out from under you.
  2. If that is actually a legitimate concern you want to counteract, then you can't even rely on your source code being either Latin-1 or UTF-8, it could be any number of other encodings (though admittedly in practice Latin-1 is a pretty common guess). So utf8_encode is not guaranteed to fix your problem at all.
  3. To be 100% agnostic of your source code file's encoding, denote your characters as raw bytes:

    $search = "\xC3\xA4,\xC3\xB6,\xC3\xBC"; // ä, ö and ü in UTF-8
    
  4. Note that this still won't guarantee what encoding $string will be in, you'll need to know and/or control its encoding separately from this issue at hand. At some point you just have to nail down your used encodings, you can't be agnostic of it all the way through.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • Sometimes I can not control my code as it is used by third parties. And by covering latin, etc it would reduce support expenditure. – mgutt Apr 13 '15 at 19:08
  • You mean to say you're sending somebody a code file and they convert it to another encoding? D-: – deceze Apr 13 '15 at 19:11
  • Yes. Thank you for your help. An additional question: Is `bin2hex()` what I need to convert my chars to raw bytes? – mgutt Apr 15 '15 at 15:53
  • 1
    If you have a UTF-8 string and you want to see its hex representation as above, then yes: `bin2hex('äöü')` and insert `\x` every two characters. – deceze Apr 15 '15 at 15:56