6

I have a Unicode string in Python. I am looking for a way to determine if there is any Chinese/Japanese character in the string. If possible it'll be better to be able to locate those characters.

It seems this is a bit different from a language detection problem. My string can be a mixture of English and Chinese texts.

My code has Internet access.

Dr. Alpha
  • 197
  • 3
  • 9
  • possible answers: http://stackoverflow.com/questions/6432926/how-can-i-relate-unicode-blocks-to-languages-scripts http://stackoverflow.com/questions/4545977/python-can-i-detect-unicode-string-language-code?rq=1 – Patashu Apr 16 '13 at 01:52

3 Answers3

4

You can use the Unicode Script property to determine what script they are commonly associated with.

Python's unicodedata module, sadly, does not have this property. However, a number of third-party modules, such as unicodedata2 and unicodescript do have this information. You can query them and check to see if you have any characters in the Han script, which corresponds to Chinese (and Kanji, and Hanja).

nneonneo
  • 171,345
  • 36
  • 312
  • 383
  • Thanks! Are any of the 3rd party modules packaged for Ubuntu or other distros? I didn't see packages for unicodedata2 or unicodescript. Are there any bugs on python for this outage? – nealmcb Sep 05 '14 at 00:03
3

You can use this regex [\u2E80-\u9FFF] to match CJK characters.

比尔盖子
  • 2,693
  • 5
  • 37
  • 53
  • 4
    This is almost, but not completely correct. According to [Scripts.txt](http://www.unicode.org/Public/UNIDATA/Scripts.txt), the official Unicode database, the Han characters cover a **subset** of `2E80` to `9FCC`, along with `F900` to `FAD9` and `20000` to `2FA1D`. But the subset is somewhat complex... – nneonneo Apr 16 '13 at 02:07
3

I tried Python's unicodedata module mentioned by nneonneo in his answer and I think it probably works.

>>> import unicodedata
>>> unicodedata.name('你')
'CJK UNIFIED IDEOGRAPH-4F60'
>>> unicodedata.name('桜')
'CJK UNIFIED IDEOGRAPH-685C'
>>> unicodedata.name('あ')
'HIRAGANA LETTER A'
>>> unicodedata.name('ア')
'KATAKANA LETTER A'
>>> unicodedata.name('a')
'LATIN SMALL LETTER A'

As you see, both Chinese characters and Japanese adopted Chinese characters are categorized to CJK UNIFIED IDEOGRAPH and hiragana and katakana correctly recognized. I didn't test Korean characters but I think they should fall into CJK UNIFIED IDEOGRAPH, too.

Also, if you only care about if it's a CJK character/letter or not, it seems this is simpler:

>>> import unicodedata
>>> unicodedata.category('你')
'Lo'
>>> unicodedata.category('桜')
'Lo'
>>> unicodedata.category('あ')
'Lo'
>>> unicodedata.category('ア')
'Lo'
>>> unicodedata.category('a')
'Ll'
>>> unicodedata.category('A')
'Lu'

According to here, Ll is lowercase, Lu is uppercase and Lo is other.

Dr. Alpha
  • 197
  • 3
  • 9
  • Korean Hangul characters are generally identified as such. The "unified" part collects glyphs which are (generally) shared among these scripts, but the Hangul script is exclusively Korean. Better anyway to look at the Script property than at the Block name or the Category (there are many `Lo` characters which are mathematical symbols, graphics decorations, etc., or just not in one of the scripts you are looking for. Arabic, Hebrew, Indic scripts etc all don't have the upper/lowercase distinction.) – tripleee Apr 16 '13 at 03:24
  • @tripleee Is there an easy way to do what you suggested in Python? – Dr. Alpha Apr 16 '13 at 03:28