6

Given a block of text that's known to be Chinese and encoded in UTF-8, is there a way to determine if it's Simplified or Traditional?

Makoto
  • 104,088
  • 27
  • 192
  • 230
philfreo
  • 41,941
  • 26
  • 128
  • 141

2 Answers2

4

I don't know if this will work, but I'd try using iconv to see if it will translate between the charsets correctly, comparing the results from the same conversion with //TRANSLIT and //IGNORE. If the two results match, then the charset conversion hasn't encountered any characters that fail to translate, so you should have a match.

$test1 = iconv("UTF-8", "big5//TRANSLIT", $text);
$test2 = iconv("UTF-8", "big5//IGNORE", $text);
if ($test1 == $test2) {
   echo 'traditional';
} else {
   $test3 = iconv("UTF-8", "gb2312//TRANSLIT", $text);
   $test4 = iconv("UTF-8", "gb2312//IGNORE", $text);
   if ($test3 == $test4) {
      echo 'simplified';
   } else {
      echo 'Failed to match either traditional or simplified';
   }
}
Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • Interesting, thanks! It seems to definitely be working, although a lot of text is coming back as "neither" (example: "聲音 鳥 樹葉 話 説話 細 又 輕 蝴蝶 請 只有 和 得 聼得到 蜜蜂"). Any ideas? I also had to do `@iconv` for the 2 `TRANSLIT` calls to suppress errors. – philfreo Nov 03 '10 at 00:36
  • 4
    You've got some z-variant characters in there that aren't in basic GB-2312, but they are in GB-18030. Try `'gb18030'` instead of `'gb2312'`. Or if your input is Windows-oriented you may prefer `'cp936'` (and `'cp950'` instead of `'big5'`). – bobince Nov 03 '10 at 21:36
  • I swapped in `gb18030` and all of my test data was recognized. (Cannot be sure of the accuracy though). Thanks! – philfreo Nov 04 '10 at 18:17
  • GB18030 is a Unicode Transformation Format, i.e. GB18030 will match every single character inside Unicode, traditional, archaic or simplified (and including, say Korean and Arabic). The text quoted is clearly Traditional Chinese but 聼 is not included inside Big5 despite it being a Traditional Chinese character. – Henry Jan 20 '16 at 19:02
2

Since big5 and gb2312 omit quite a few commonly used variants that are present in Unicode, the code rely on exact match between the translit and ignore modes would fail in quite a lot of normal use cases: it would fail to identify 説話 as Traditional Chinese despite being a common variant in Hong Kong for which is used in big5.

A simple fix is to do it in a fuzzy way:

$test1 = iconv("UTF-8", "big5//IGNORE", $text);
$test2 = iconv("UTF-8", "gb2312//IGNORE", $text);
$len1 = mb_strlen($test1);
$len2 = mb_strlen($test2);
$len0 = mb_strlen($text) * 0.8; // threshold
if ($len1 > $len2 && $len1 > $len0) {
    return 'Likely Traditional';
}
if ($len2 > $len1 && $len2 > $len0) {
    return 'Likely Simplified';
}
return 'Could not identify';
Henry
  • 1,339
  • 13
  • 24