How to find out Chinese or Japanese Character in a String in Python?

Question

Such as:

str = 'sdf344asfasf天地方益3権sdfsdf'

Add () to Chinese and Japanese Characters:

strAfterConvert = 'sdfasfasf(天地方益)3(権)sdfsdf'

Since this is rather broad and I don't want to look up the ranges: you'd decode from UTF-8 to get `unicode` objects, then use a regex to detect specific *ranges* of Unicode codepoints. What those ranges are for Chinese and Japanese is an exercise in research into the Unicode standard. — Martijn Pieters, May 06 '15 at 07:28
Related: http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode — EdChum, May 06 '15 at 07:47
From the link I posted above you could iterate over the characters and test the value of `ord` against the various CJK ranges for Chinese chracters — EdChum, May 06 '15 at 07:51
@EdChum Sorry, this is wildly off-topic - but you should change your profile text (I checked your prof for the lulz) from " peoples' " to " people's ". Otherwise, you're saying that hell is the code of other ethnic groups/nations (a people). — EvenLisle, May 06 '15 at 19:05
@EvenLisle Hmm, maybe I really do think that also though.... — EdChum, May 06 '15 at 19:11

score 28 · Answer 1 · edited Dec 22 '18 at 22:11

28

As a start, you can check if the character is in one of the following unicode blocks:

Unicode Block 'CJK Unified Ideographs' - U+4E00 to U+9FFF
Unicode Block 'CJK Unified Ideographs Extension A' - U+3400 to U+4DBF
Unicode Block 'CJK Unified Ideographs Extension B' - U+20000 to U+2A6DF
Unicode Block 'CJK Unified Ideographs Extension C' - U+2A700 to U+2B73F
Unicode Block 'CJK Unified Ideographs Extension D' - U+2B740 to U+2B81F

After that, all you need to do is iterate through the string, checking if the char is Chinese, Japanese or Korean (CJK) and append accordingly:

# -*- coding:utf-8 -*-
ranges = [
  {"from": ord(u"\u3300"), "to": ord(u"\u33ff")},         # compatibility ideographs
  {"from": ord(u"\ufe30"), "to": ord(u"\ufe4f")},         # compatibility ideographs
  {"from": ord(u"\uf900"), "to": ord(u"\ufaff")},         # compatibility ideographs
  {"from": ord(u"\U0002F800"), "to": ord(u"\U0002fa1f")}, # compatibility ideographs
  {'from': ord(u'\u3040'), 'to': ord(u'\u309f')},         # Japanese Hiragana
  {"from": ord(u"\u30a0"), "to": ord(u"\u30ff")},         # Japanese Katakana
  {"from": ord(u"\u2e80"), "to": ord(u"\u2eff")},         # cjk radicals supplement
  {"from": ord(u"\u4e00"), "to": ord(u"\u9fff")},
  {"from": ord(u"\u3400"), "to": ord(u"\u4dbf")},
  {"from": ord(u"\U00020000"), "to": ord(u"\U0002a6df")},
  {"from": ord(u"\U0002a700"), "to": ord(u"\U0002b73f")},
  {"from": ord(u"\U0002b740"), "to": ord(u"\U0002b81f")},
  {"from": ord(u"\U0002b820"), "to": ord(u"\U0002ceaf")}  # included as of Unicode 8.0
]

def is_cjk(char):
  return any([range["from"] <= ord(char) <= range["to"] for range in ranges])

def cjk_substrings(string):
  i = 0
  while i<len(string):
    if is_cjk(string[i]):
      start = i
      while is_cjk(string[i]): i += 1
      yield string[start:i]
    i += 1

string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
for sub in cjk_substrings(string):
  string = string.replace(sub, "(" + sub + ")")
print string

The above prints

sdf344asfasf(天地方益)3(権)sdfsdf

To be future-proof, you might want to keep a lookout for CJK Unified Ideographs Extension E. It will ship with Unicode 8.0, which is scheduled for release in June 2015. I've added it to the ranges, but you shouldn't include it until Unicode 8.0 is released.

[EDIT]

Added CJK compatibility ideographs, Japanese Kana and CJK radicals.

edited Dec 22 '18 at 22:11

python dude

7,980
11
40
53

answered May 06 '15 at 07:51

EvenLisle

4,672
3
24
47

1

That doesn't cover all the various ranges: http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode – EdChum May 06 '15 at 07:52
1

@EdChum I've updated my answer to include the available unicode ranges. – EvenLisle May 06 '15 at 08:39
Those ranges will miss Japanese Kana characters and a bunch of CJK symbols, strokes, radicals, compatibility characters, and phonetic extensions. It would be easier, and more reliable, to check the Unicode "Script" property. – 一二三 May 06 '15 at 08:59
@一二三 Updated answer to include CJK compatibility ideographs and Japanese Kana. – EvenLisle May 06 '15 at 09:26
@一二三 Thanks for your suggestions, added CJK radicals supplement. Anything still missing? – EvenLisle May 06 '15 at 10:40
I think this is a fairly complete list: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Script=Han:][:Script=Bopo:][:Script=Hira:][:Script=Katakana:] – 一二三 May 06 '15 at 11:39
I can't see anything there that's not in my answer (although I haven't combed through it thoroughly). I haven't been able to find a solid ready-to-use library for this, so if you've got some specific unicode subset in mind, I'd appreciate if you mentioned it. – EvenLisle May 06 '15 at 12:24
You're missing the very first character in there. As I mentioned in my first comment, you should be checking the "Script" property of each character—not checking blocks. The list I posted shows the script subset I think would be appropriate. – 一二三 May 07 '15 at 01:10
1

I got a 'TypeError: ord() expected a character, but string of length 2 found' for {"from": ord(u"\U0002a700"), "to": ord(u"\U0002b73f")} and all the other rows containing "\U". Can you please take a look? Thanks. – J Freebird Aug 19 '15 at 00:46
@JFreebird I got this too. Different compilations of python have different character ranges. Instead of using `ord` to build the ranges, use hex strings like this: http://stackoverflow.com/a/9169489/306503 – Robert Dodd Apr 25 '17 at 20:51
1

The range for hiragana is missing. Please add {'from': ord(u'\u3040'), 'to': ord(u'\u309f')}. – lacton Jul 17 '17 at 12:36
nice work, but using builtin names like `range` for local variables is a bad practice – z33k May 06 '19 at 11:24
also you don't need to build a list for evaluation by `any()` (generator expression will suffice) – z33k May 06 '19 at 12:26

一二三 · Answer 2 · 2015-05-07T12:25:04.793

20

You can do the edit using the regex package, which supports checking the Unicode "Script" property of each character and is a drop-in replacement for the re package:

import regex as re

pattern = re.compile(r'([\p{IsHan}\p{IsBopo}\p{IsHira}\p{IsKatakana}]+)', re.UNICODE)

input = u'sdf344asfasf天地方益3権sdfsdf'
output = pattern.sub(r'(\1)', input)
print output  # Prints: sdf344asfasf(天地方益)3(権)sdfsdf

You should adjust the \p{Is...} sequences with the character scripts/blocks that you consider to be "Chinese or Japanese".

edited May 07 '15 at 12:25

answered May 07 '15 at 12:18

一二三

21,059
11
65
74

Can `regex` tell which type it belongs to? – natsuapo May 19 '17 at 09:11

alvas · Answer 3 · 2016-05-18T22:42:51.047

From one of the bleeding edge branch of NLTK inspired by the Moses Machine Translation Toolkit:

def is_cjk(character):
    """"
    Checks whether character is CJK.

        >>> is_cjk(u'\u33fe')
        True
        >>> is_cjk(u'\uFE5F')
        False

    :param character: The character that needs to be checked.
    :type character: char
    :return: bool
    """
    return any([start <= ord(character) <= end for start, end in 
                [(4352, 4607), (11904, 42191), (43072, 43135), (44032, 55215), 
                 (63744, 64255), (65072, 65103), (65381, 65500), 
                 (131072, 196607)]
                ])

For the specifics of the ord() numbers:

class CJKChars(object):
    """
    An object that enumerates the code points of the CJK characters as listed on
    http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane

    This is a Python port of the CJK code point enumerations of Moses tokenizer:
    https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl#L309
    """
    # Hangul Jamo (1100–11FF)
    Hangul_Jamo = (4352, 4607) # (ord(u"\u1100"), ord(u"\u11ff"))

    # CJK Radicals Supplement (2E80–2EFF)
    # Kangxi Radicals (2F00–2FDF)
    # Ideographic Description Characters (2FF0–2FFF)
    # CJK Symbols and Punctuation (3000–303F)
    # Hiragana (3040–309F)
    # Katakana (30A0–30FF)
    # Bopomofo (3100–312F)
    # Hangul Compatibility Jamo (3130–318F)
    # Kanbun (3190–319F)
    # Bopomofo Extended (31A0–31BF)
    # CJK Strokes (31C0–31EF)
    # Katakana Phonetic Extensions (31F0–31FF)
    # Enclosed CJK Letters and Months (3200–32FF)
    # CJK Compatibility (3300–33FF)
    # CJK Unified Ideographs Extension A (3400–4DBF)
    # Yijing Hexagram Symbols (4DC0–4DFF)
    # CJK Unified Ideographs (4E00–9FFF)
    # Yi Syllables (A000–A48F)
    # Yi Radicals (A490–A4CF)
    CJK_Radicals = (11904, 42191) # (ord(u"\u2e80"), ord(u"\ua4cf"))

    # Phags-pa (A840–A87F)
    Phags_Pa = (43072, 43135) # (ord(u"\ua840"), ord(u"\ua87f"))

    # Hangul Syllables (AC00–D7AF)
    Hangul_Syllables = (44032, 55215) # (ord(u"\uAC00"), ord(u"\uD7AF"))

    # CJK Compatibility Ideographs (F900–FAFF)
    CJK_Compatibility_Ideographs = (63744, 64255) # (ord(u"\uF900"), ord(u"\uFAFF"))

    # CJK Compatibility Forms (FE30–FE4F)
    CJK_Compatibility_Forms = (65072, 65103) # (ord(u"\uFE30"), ord(u"\uFE4F"))

    # Range U+FF65–FFDC encodes halfwidth forms, of Katakana and Hangul characters
    Katakana_Hangul_Halfwidth = (65381, 65500) # (ord(u"\uFF65"), ord(u"\uFFDC"))

    # Supplementary Ideographic Plane 20000–2FFFF
    Supplementary_Ideographic_Plane = (131072, 196607) # (ord(u"\U00020000"), ord(u"\U0002FFFF"))

    ranges = [Hangul_Jamo, CJK_Radicals, Phags_Pa, Hangul_Syllables, 
              CJK_Compatibility_Ideographs, CJK_Compatibility_Forms, 
              Katakana_Hangul_Halfwidth, Supplementary_Ideographic_Plane]

Combining the is_cjk() in this answer and @EvenLisle substring answer

>>> from nltk.tokenize.util import is_cjk
>>> text = u'sdf344asfasf天地方益3権sdfsdf'
>>> [1 if is_cjk(ch) else 0 for ch in text]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0]
>>> def cjk_substrings(string):
...     i = 0
...     while i<len(string):
...         if is_cjk(string[i]):
...             start = i
...             while is_cjk(string[i]): i += 1
...             yield string[start:i]
...         i += 1
... 
>>> string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
>>> for sub in cjk_substrings(string):
...     string = string.replace(sub, "(" + sub + ")")
... 
>>> string
u'sdf344asfasf(\u5929\u5730\u65b9\u76ca)3(\u6a29)sdfsdf'
>>> print string
sdf344asfasf(天地方益)3(権)sdfsdf

Thx! It works, where @EvenLisle's answer failed: `ﾃﾝﾎﾟﾗﾘ`. Minor quibble: `type character: char` in the docstring - it's `str` actually (there's no `char` type in Python) — z33k, May 07 '19 at 09:52

score 5 · Answer 4 · edited May 23 '17 at 10:31

If you can't use regex module that provides access to IsKatakana, IsHan properties as shown in @一二三's answer; you could use character ranges from @EvenLisle's answer with stdlib's re module:

>>> import re
>>> print(re.sub(u"([\u3300-\u33ff\ufe30-\ufe4f\uf900-\ufaff\U0002f800-\U0002fa1f\u30a0-\u30ff\u2e80-\u2eff\u4e00-\u9fff\u3400-\u4dbf\U00020000-\U0002a6df\U0002a700-\U0002b73f\U0002b740-\U0002b81f\U0002b820-\U0002ceaf]+)", r"(\1)", u'sdf344asfasf天地方益3権sdfsdf'))
sdf344asfasf(天地方益)3(権)sdfsdf

Beware of known issues.

You could also check Unicode category:

>>> import unicodedata
>>> unicodedata.category(u'天')
'Lo'
>>> unicodedata.category(u's')
'Ll'

How to find out Chinese or Japanese Character in a String in Python?

4 Answers4

Linked