1

I need to replace special characters inside a string with other characters. For example a "ä" can be replaced by either "a" or "ae" and a "à" with "a" as well. Normally this is pretty easy to do with PHP and there are lots of functions on stackoverflow, which already do excactly that.

Unfortunately my string looks like this: "u\u0308 a\u0302 a\u0308 o\u0300.zip" (ü â ä ò.zip). As you might see my strings are file names and OSX seems to convert the characters to unicode (at least that is what i think).

I know that i could use a very long array with all special characters to replace them in PHP:

$str = "u\u0308 a\u0302 a\u0308 o\u0300.zip";

$ch = array("u\u0308", "a\u0302", "a\u0308", "o\u0300");
$chReplace = = array("u", "a", "a", "o");

str_replace($ch, $chReplace, $str);

But I'm wondering if there is an easier way, so I don't have to do this manually for every character?

Amal Murali
  • 75,622
  • 18
  • 128
  • 150
mrksbnch
  • 1,792
  • 2
  • 28
  • 46
  • 3
    [mb_convert_encoding()](http://www.php.net/manual/en/function.mb-convert-encoding.php) or [iconv()](http://www.php.net/manual/en/function.iconv.php) with `//TRANSLIT` – Mark Baker Mar 31 '14 at 07:56
  • 2
    `utf8_encode($data)` might also work. – Pim Verlangen Mar 31 '14 at 07:57
  • @PimVerlangen `utf8_encode` outputs the same string, so "u\u0308 a\u0302 a\u0308 o\u0300.zip" – mrksbnch Mar 31 '14 at 08:01
  • @MarkBaker What do i have to use as charset/encoding type? – mrksbnch Mar 31 '14 at 08:02
  • If you're trying to convert UTF-8 to whatever charset is used by your filesystems, then those are the charsets that you specify for mb_convert_encoding() or iconv() – Mark Baker Mar 31 '14 at 08:03
  • @MarkBaker I'm trying to convert the string above (which looks like unicode to me?) to "normal" characters, so in my case "ü â ä ò". So I think I'm looking for `mb_convert_encoding($str, WHAT_ENCODING_HERE, "UTF-8");` – mrksbnch Mar 31 '14 at 08:08
  • 1
    What charset is it likely to be? I can only guess! "ISO-8859-1" perhaps – Mark Baker Mar 31 '14 at 08:12
  • @MarkBaker Thanks, need to figure that out, i don't know much about charsets – mrksbnch Mar 31 '14 at 08:43

1 Answers1

2

You can solve this problem by dividing it into multiple steps:

  • Convert the Unicode code points to actual entities. This can be easily achieved using preg_replace(). For an explanation of how the regex works, see my answer here.

  • Now you will have a set of characters like ü. These are HTML entities. To convert them into their corresponding character forms, use html_entity_decode().

  • You will now have a UTF-8 string. You need to convert it into ISO-8859-1 (Official ISO 8-bit Latin-1). The //TRANSLIT part is to enable transileration. If this is enabled, when a character can't be represented in the target charset, it will try to approximate the result.

Code:

// Set the locale to something that's UTF-8 capable
setlocale(LC_ALL, 'en_US.UTF-8');

$str = "u\u0308 a\u0302 a\u0308 o\u0300";

// Convert the codepoints to entities
$str = preg_replace("/\\\\u([0-9a-fA-F]{4})/", "&#x\\1;", $str);

// Convert the entities to a UTF-8 string
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');

// Convert the UTF-8 string to an ISO-8859-1 string
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", $str);

Output:

u a a o

Demo

Community
  • 1
  • 1
Amal Murali
  • 75,622
  • 18
  • 128
  • 150
  • Thanks a lot, haven't thought of entities! Not sure what part of the code is causing this, but if there is no space in between the unicode characters the code will output "?". So "a\u0308a\u0308" (ää) outputs "a?". I tried adding a {4} to the regex "'/\\\\u([0-9a-f]+){4}/i'", but it doesn't seem to fix the problem. – mrksbnch Mar 31 '14 at 12:38
  • @demrks: What should it output instead? `aa`? – Amal Murali Mar 31 '14 at 12:48
  • Yes, in this case 'aa'. If the string is "o\u0308u\u0308" (öü) something like "ou". – mrksbnch Mar 31 '14 at 12:55
  • Strange, "öü" actually works, but "ää" (-> "aa") doesn't, see https://eval.in/129478 – mrksbnch Mar 31 '14 at 13:00
  • Got it, this regex works for me: `preg_replace("/\\\\u([0-9a-fA-F]{4})/", "\\1;", $string)` – mrksbnch Mar 31 '14 at 14:42
  • @demrks: Excellent. Thanks for taking the time to investigate the issue. I've now updated my answer to include the working solution. Cheers! – Amal Murali Mar 31 '14 at 14:48