0

I have to convert a UTF-8 string to non UTF-8 string, I want to replace:

Ậ,Ẫ,Ẩ,Ầ,Ấ,Â,Ặ,Ẵ,Ẳ,Ằ,Ắ,Ă,Ạ,Ã,Ả,À,Á to A,

Ự,Ữ,Ử,Ừ,Ứ,Ư,Ụ,Ũ,Ủ,Ù,Ú to U,

Ợ,Ỡ,Ở,Ờ,Ớ,Ơ,Ộ,Ỗ,Ổ,Ồ,Ố,Ô,Ọ,Õ,Ỏ,Ò,Ó to O... and much more with same case

By php preg_replace() ?

May I use ?:

$string = preg_replace('/Ậ,Ẫ,Ẩ,Ầ,Ấ,Â,Ặ,Ẵ,Ẳ,Ằ,Ắ,Ă,Ạ,Ã,Ả,À,Á/', 'A', $string);
$string = preg_replace('/Ợ,Ỡ,Ở,Ờ,Ớ,Ơ,Ộ,Ỗ,Ổ,Ồ,Ố,Ô,Ọ,Õ,Ỏ,Ò,Ó/', 'O', $string);
$string = preg_replace('/Ự,Ữ,Ử,Ừ,Ứ,Ư,Ụ,Ũ,Ủ,Ù,Ú/', 'U', $string);
Sachin Prasad
  • 5,365
  • 12
  • 54
  • 101
Jio Cul
  • 21
  • 4
  • possible duplicate of [Replacing accented characters php](http://stackoverflow.com/questions/3371697/replacing-accented-characters-php) – PleaseStand Jan 15 '13 at 17:50
  • no, that posts useless with me, i just want to use `preg_replace()` and not all UTF-8 chars – Jio Cul Jan 15 '13 at 18:08

3 Answers3

1

Since regular expressions aren't the optimal way of solving this, may I hint at PHPs iconv facilities:

$string = 'ỬỪỨƯỤ';

// temporarily switch locale
$locale = setlocale(LC_CTYPE, 'en_US.UTF-8');
// use iconv to transliterate
$string = iconv('utf-8', 'us-ascii//TRANSLIT', $string);
// restore locale
setlocale(LC_CTYPE, $locale);

// $string is now "UUUUU"
Linus Kleen
  • 33,871
  • 11
  • 91
  • 99
  • While this is a great idea, it doesn't always come with the expected result. For example, "aÎâa" gets translated as "a^I^aa". It probably depends on other factors as well, since "ỬỪỨƯỤ" for me gets translated as "". – rid Jan 15 '13 at 18:00
0

You can, if you remove the ,, place everything inside a character class, and add the /u modifier if the input string is UTF-8. Example:

preg_replace('/[ỰỮỬỪỨƯỤŨỦÙÚ]/u', 'U', $string);

You can also use str_replace():

str_replace(array('Ự', 'Ữ', ...), 'U', $string);

or strtr():

strtr($string, 'ỰỮỬ', 'UUU');
rid
  • 61,078
  • 31
  • 152
  • 193
0

I've only done this in Java once, but in php the trick will be similar.

In unicode, if you normalize, a diacritic is encoded using combining diacritical marks i.e. one letter and one symbol. Just drop the symbol and you're done.

private static final Pattern DIACRITIC =
        Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");


public static String replaceCombiningDiacriticalMarks(String text) {
    return DIACRITIC.matcher(Normalizer.normalize(text, Normalizer.Form.NFKD)).replaceAll("");
}

If you also have characters from other alphabets, or mathematical symbols, things get trickier. It's still possible to replace them with pure ascii (like a √ with a v for example), but it gets arbitrary which character to pick.

iwein
  • 25,788
  • 10
  • 70
  • 111