Recognizing text as Simplified vs. Traditional Chinese

Question

Given a block of text that's known to be Chinese and encoded in UTF-8, is there a way to determine if it's Simplified or Traditional?

score 4 · Accepted Answer · answered Nov 03 '10 at 00:07

4

I don't know if this will work, but I'd try using iconv to see if it will translate between the charsets correctly, comparing the results from the same conversion with //TRANSLIT and //IGNORE. If the two results match, then the charset conversion hasn't encountered any characters that fail to translate, so you should have a match.

$test1 = iconv("UTF-8", "big5//TRANSLIT", $text);
$test2 = iconv("UTF-8", "big5//IGNORE", $text);
if ($test1 == $test2) {
   echo 'traditional';
} else {
   $test3 = iconv("UTF-8", "gb2312//TRANSLIT", $text);
   $test4 = iconv("UTF-8", "gb2312//IGNORE", $text);
   if ($test3 == $test4) {
      echo 'simplified';
   } else {
      echo 'Failed to match either traditional or simplified';
   }
}

answered Nov 03 '10 at 00:07

Mark Baker

209,507
32
346
385

Interesting, thanks! It seems to definitely be working, although a lot of text is coming back as "neither" (example: "聲音鳥樹葉話説話細又輕蝴蝶請只有和得聼得到蜜蜂"). Any ideas? I also had to do `@iconv` for the 2 `TRANSLIT` calls to suppress errors. – philfreo Nov 03 '10 at 00:36
4

You've got some z-variant characters in there that aren't in basic GB-2312, but they are in GB-18030. Try `'gb18030'` instead of `'gb2312'`. Or if your input is Windows-oriented you may prefer `'cp936'` (and `'cp950'` instead of `'big5'`). – bobince Nov 03 '10 at 21:36
I swapped in `gb18030` and all of my test data was recognized. (Cannot be sure of the accuracy though). Thanks! – philfreo Nov 04 '10 at 18:17
GB18030 is a Unicode Transformation Format, i.e. GB18030 will match every single character inside Unicode, traditional, archaic or simplified (and including, say Korean and Arabic). The text quoted is clearly Traditional Chinese but 聼 is not included inside Big5 despite it being a Traditional Chinese character. – Henry Jan 20 '16 at 19:02

score 2 · Answer 2 · answered Jan 20 '16 at 19:12

Since big5 and gb2312 omit quite a few commonly used variants that are present in Unicode, the code rely on exact match between the translit and ignore modes would fail in quite a lot of normal use cases: it would fail to identify 説話 as Traditional Chinese despite 説 being a common variant in Hong Kong for 說 which is used in big5.

A simple fix is to do it in a fuzzy way:

$test1 = iconv("UTF-8", "big5//IGNORE", $text);
$test2 = iconv("UTF-8", "gb2312//IGNORE", $text);
$len1 = mb_strlen($test1);
$len2 = mb_strlen($test2);
$len0 = mb_strlen($text) * 0.8; // threshold
if ($len1 > $len2 && $len1 > $len0) {
    return 'Likely Traditional';
}
if ($len2 > $len1 && $len2 > $len0) {
    return 'Likely Simplified';
}
return 'Could not identify';

Recognizing text as Simplified vs. Traditional Chinese

2 Answers2

Linked