Python regex to recognize Chinese numerals

Question

Using python 2.7

I am trying to write a regex that can recognize any utf-8 number 0-9 (not just arabic numerals, but simplified chinese as well) and any unicode word character.

For example I have:

4_1424336,P-九

(九 is chinese 9).

And I want to return:

9_9999999,A-9

My current function is:

def multiple_replace(myString):
    myString = re.sub(ur'(?u)[^\W_*\d]', u'A', myString)
    myString = re.sub(ur'(?u)[\d]', u'9', myString)
    return myString

EDITED:

Also tried...same result

def multiple_replace(myString):
    myLetters_regex = re.compile(r'[^\W\d_]', re.UNICODE)
    myNumbers_regex = re.compile(r'[\d]', re.UNICODE)
    myString = myNumbers_regex.sub('9', myString)
    myString = myLetters_regex.sub('A', myString)
    return myString

and I get...

9_9999999,A-A (i.e. 九 is recognized is flagged as an 'A' instead of a '9')

So, my q's are:

1) Is there any other way to write the \W to NOT include the numerics in the alphanumerics?

2) Is there something I am missing about recognizing Chinese numerals using python regex?

For #2, try setting the `re.UNICODE` flag when defining your regular expression. Still digging on #1 - your character class that excludes `\W_*\d` might be the best way to go, once the `\w` and `\d` classes are unicode-aware. Although `*` is not generally considered a word character so I don't think you need to explicitly forbid it. — Peter DeGlopper, Nov 30 '13 at 22:36
AFAIK, the answer to #1 is no. The character class is fixed. You could however, define your own for convenience. — JDong, Nov 30 '13 at 22:44
see this post: [find all chinese characters in python](http://stackoverflow.com/questions/2718196/find-all-chinese-text-in-a-string-using-python-and-regex) — gongzhitaao, Nov 30 '13 at 22:44
@ Peter...see OP (edited). The re.UNICODE flag didn't make a difference. I know errors have been found in the python re module before...could this possibly be one? — user14696, Nov 30 '13 at 22:50
I doubt it - I would like to test this more closely, and knowing the code point in question would help. — Peter DeGlopper, Nov 30 '13 at 22:51
@ Peter, I've been using this to get the hex... http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E4%B9%9D&mode=char — user14696, Nov 30 '13 at 22:52
If all else fails, this library claims to support all Unicode properties: https://pypi.python.org/pypi/regex — Peter DeGlopper, Nov 30 '13 at 22:59
Digging pretty deep - the Unicode properties database defines the category for that codepoint to be `Lo`, or 'Letter, Other'. The Unihan database knows its numerical value (kPrimaryNumeric) but it looks like that isn't enough to make `\d` count it as a digit. I'm not sure anymore whether there is a way to match it with a character class. — Peter DeGlopper, Nov 30 '13 at 23:14
@ Peter...I am testing a function using pypi...but what I would really like to know is how you found out that the unicode properties database has 九 as a 'Letter,Other'. I put in >> import unicodedata >>unicodedata.category(u'九') and I get "Unsupported characters in input".... — user14696, Dec 02 '13 at 00:13
@ Peter - are u using python 3 (the above error only occurs in 2.7) — user14696, Dec 02 '13 at 02:09

score 0 · Answer 1 · answered Dec 01 '13 at 01:38

0

Check Ponyguruma, a Python binding to the Oniguruma regular expression engine.

For numbers:

re.sub(ur'\p{N}', '9')

For letters:

re.sub(ur'\p{L}', 'A')

answered Dec 01 '13 at 01:38

Ωmega

42,614
34
134
203

Um...Can you NOT install this on a Windows machine? I've tried (and even did a round in Cygwin -- ended up with a bunch of errors and 'error: command 'gcc' failed with exit status 1'). Looking over the GitHub repo didn't lead to much insight on this... – user14696 Dec 02 '13 at 01:49

Python regex to recognize Chinese numerals

1 Answers1