Given a block of text that's known to be Chinese and encoded in UTF-8, is there a way to determine if it's Simplified or Traditional?
Asked
Active
Viewed 5,072 times
2 Answers
4
I don't know if this will work, but I'd try using iconv to see if it will translate between the charsets correctly, comparing the results from the same conversion with //TRANSLIT and //IGNORE. If the two results match, then the charset conversion hasn't encountered any characters that fail to translate, so you should have a match.
$test1 = iconv("UTF-8", "big5//TRANSLIT", $text);
$test2 = iconv("UTF-8", "big5//IGNORE", $text);
if ($test1 == $test2) {
echo 'traditional';
} else {
$test3 = iconv("UTF-8", "gb2312//TRANSLIT", $text);
$test4 = iconv("UTF-8", "gb2312//IGNORE", $text);
if ($test3 == $test4) {
echo 'simplified';
} else {
echo 'Failed to match either traditional or simplified';
}
}

Mark Baker
- 209,507
- 32
- 346
- 385
-
Interesting, thanks! It seems to definitely be working, although a lot of text is coming back as "neither" (example: "聲音 鳥 樹葉 話 説話 細 又 輕 蝴蝶 請 只有 和 得 聼得到 蜜蜂"). Any ideas? I also had to do `@iconv` for the 2 `TRANSLIT` calls to suppress errors. – philfreo Nov 03 '10 at 00:36
-
4You've got some z-variant characters in there that aren't in basic GB-2312, but they are in GB-18030. Try `'gb18030'` instead of `'gb2312'`. Or if your input is Windows-oriented you may prefer `'cp936'` (and `'cp950'` instead of `'big5'`). – bobince Nov 03 '10 at 21:36
-
I swapped in `gb18030` and all of my test data was recognized. (Cannot be sure of the accuracy though). Thanks! – philfreo Nov 04 '10 at 18:17
-
GB18030 is a Unicode Transformation Format, i.e. GB18030 will match every single character inside Unicode, traditional, archaic or simplified (and including, say Korean and Arabic). The text quoted is clearly Traditional Chinese but 聼 is not included inside Big5 despite it being a Traditional Chinese character. – Henry Jan 20 '16 at 19:02
2
Since big5
and gb2312
omit quite a few commonly used variants that are present in Unicode, the code rely on exact match between the translit
and ignore
modes would fail in quite a lot of normal use cases: it would fail to identify 説話
as Traditional Chinese despite 説
being a common variant in Hong Kong for 說
which is used in big5
.
A simple fix is to do it in a fuzzy way:
$test1 = iconv("UTF-8", "big5//IGNORE", $text);
$test2 = iconv("UTF-8", "gb2312//IGNORE", $text);
$len1 = mb_strlen($test1);
$len2 = mb_strlen($test2);
$len0 = mb_strlen($text) * 0.8; // threshold
if ($len1 > $len2 && $len1 > $len0) {
return 'Likely Traditional';
}
if ($len2 > $len1 && $len2 > $len0) {
return 'Likely Simplified';
}
return 'Could not identify';

Henry
- 1,339
- 13
- 24