Python 2.7: test if characters in a string are all Chinese characters

Question

The following code tests if characters in a string are all Chinese characters. It works for Python 3 but not for Python 2.7. How do I do it in Python 2.7?

for ch in name:
    if ord(ch) < 0x4e00 or ord(ch) > 0x9fff:
        return False

Is `name` a unicode string or a byte string? You don't have to use `ord` here, btw: `if ch < u'\u4e00' or ch > u'\u9fff':` works too. — Martijn Pieters, May 08 '13 at 13:11
Related: http://stackoverflow.com/questions/16027450/is-there-a-way-to-know-whether-a-unicode-string-contains-any-chinese-japanese-ch/16028174#16028174 — Daenyth, May 08 '13 at 13:14

root · Accepted Answer · 2013-05-08T14:21:58.043

#  byte str (you probably get from GAE)
In [1]: s = """Chinese (汉语/漢語 Hànyǔ or 中文 Zhōngwén) is a group of related
        language varieties, several of which are not mutually intelligible,"""

#  unicode str
In [2]: us = u"""Chinese (汉语/漢語 Hànyǔ or 中文 Zhōngwén) is a group of related
        language varieties, several of which are not mutually intelligible,"""

#  convert to unicode using str.decode('utf-8')    
In [3]: print ''.join(c for c in s.decode('utf-8') 
                   if u'\u4e00' <= c <= u'\u9fff')
汉语漢語中文

In [4]: print ''.join(c for c in us if u'\u4e00' <= c <= u'\u9fff')
汉语漢語中文

To make sure all the characters are Chinese, something like this should do:

all(u'\u4e00' <= c <= u'\u9fff' for c in name.decode('utf-8'))

In your python application, use unicode internally - decode early & encode late - creating a unicode sandwich.

Only one comment - rather than decoding into a nonce value, it might be better to store the decoded unicode object, and work internally with unicode. — Marcin, May 08 '13 at 13:49
@Marcin -- You are absolutely right, will add a note, thanks. — root, May 08 '13 at 13:50

Martijn Pieters · Answer 2 · 2013-05-08T14:48:00.777

5

This works fine for me in Python 2.7, provided name is a unicode() value:

>>> ord(u'\u4e00') < 0x4e00
False
>>> ord(u'\u4dff') < 0x4e00
True

You do not have to use ord here if you compare the character directly with unicode values:

>>> u'\u4e00' < u'\u4e00'
False
>>> u'\u4dff' < u'\u4e00'
True

Data from an incoming request will not yet have been decoded to unicode, you'll need to do that first. Explicitly set the accept-charset attribute on your form tag to ensure that the browser uses the correct encoding:

<form accept-charset="utf-8" action="...">

then decode the data on the server side:

name = self.request.get('name').decode('utf8')

edited May 08 '13 at 14:48

answered May 08 '13 at 13:14

Martijn Pieters

1,048,767
296
4,058
3,343

1

I am working on Google App Engine with Python. The `name` is obtained by `name = self.request.get('name')` from a form, and the user need to enter Chinese characters only. Do I need to convert `name` into unicode? And how? – Randy Tang May 08 '13 at 13:26
1

@Tang: Yes, you'd have to convert the data to Unicode first. Browsers usually use the encoding of the HTML page, so if you serve your pages with `Content-Type: text/html; charset=utf8` then you can assume you can decode as UTF-8 as well. – Martijn Pieters May 08 '13 at 14:42

Python 2.7: test if characters in a string are all Chinese characters

2 Answers2

Linked