13

I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.

What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?


update

Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?

http://unicode.org/faq/han_cjk.html

Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:

A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.

Community
  • 1
  • 1
thenengah
  • 42,557
  • 33
  • 113
  • 157
  • 1
    would the codepage help distinguish? Seems like simplified Chinese is CP 936 and Traditional is CP 950, at least in the Microsoft world. Perhaps start at http://www.i18nguy.com/unicode/codepages.html for the MS and IBM codepages. – rajah9 Jan 06 '11 at 20:44
  • 4
    I did a quick google search and found this http://unicode.org/faq/han_cjk.html I found some of the questions interesting and they discuss Traditional characters in there too. Hope it helps! – Shaded Jan 06 '11 at 20:44
  • 2
    Shaded's linked FAQ seems to answer your question exactly. As the example in the link notes, how would you determine if "chat" is English or French? If you don't think that your answer is in there, you might want to expand your question a bit. – Thanatos Jan 06 '11 at 21:25
  • It's a good link, one that I got to prior. Ah quite complicated. The orthography of chat/chat en/fn surely makes it indistinguishable; however, if we used the IPA to write chat/chat [ʃæ/tʃæt] it would be possible through syllable construction because it would be based on sound and not an archaic orthography. – thenengah Jan 07 '11 at 20:43
  • 1
    But Chinese is much less complicated because 說/说 [ t/s shuo1 'to speak'] are completely different characters one being the traditional equivalent to speak and one being the simplified equivalent to speak. They have different unicode values opposed to a/a en/fn which share the same character code. – thenengah Jan 07 '11 at 20:47
  • However, there are also lots of characters like 口 which are used in both simplified *and* traditional Chinese, and trying to decide "which one it is" is the same as trying to decide if "chat" is English or French. – lambshaanxy May 28 '12 at 06:42
  • a possible python answer: https://stackoverflow.com/a/57945099/191246 – ccpizza Dec 26 '22 at 12:45

3 Answers3

6

As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.

Community
  • 1
  • 1
lambshaanxy
  • 22,552
  • 10
  • 68
  • 92
5

It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:

  1. Characters that are traditional only.
  2. Characters that are simplified only.
  3. Characters that have been left untouched, and are available in both.

Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for and , face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to . So you can deduct that it is a traditional character only.

But also has a kTraditionalVariant, which points to . This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...

On the other hand, has a kTraditionalVariant, pointing to , and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.

dda
  • 6,030
  • 2
  • 25
  • 34
2

As I think you've discovered, you can't. Simplified and traditional are just two styles of writing the same characters - it's like the difference between Roman and Gothic script for European languages.

dda
  • 6,030
  • 2
  • 25
  • 34
Tom Anderson
  • 46,189
  • 17
  • 92
  • 133
  • Yeah, I guess so. Function over form type thing. It's a catch 22. You already have to know if the character is S/T in order to check it's value. So I'm just going to build dictionaries first and then check by those :) – thenengah Jan 06 '11 at 21:30
  • BTW - there actually is a way to check through the bytes, but the unicode site said it was impractical because there were a ton of exceptions. Go figure! :) – thenengah Jan 06 '11 at 21:32
  • This is false, and the analogy to Roman/Gothic is also false. As dda explains below, Simplified and Traditional are _overlapping_ character sets. Characters are either: 1) found only in Traditional, 2) found only in Simplified, or 3) found in both. Because each character has its own, unique Unicode code point, you can, at a minimum, auto-detect which of these three categories they belong in. Re: Roman/Gothic, you seem to mean typographic script (font) and not alphabet, but Trad/Simp are definitely not just different fonts for identical bytestreams. – Ryan Lue Feb 22 '21 at 20:49