0

Using python 2.7

I am trying to write a regex that can recognize any utf-8 number 0-9 (not just arabic numerals, but simplified chinese as well) and any unicode word character.

For example I have:

4_1424336,P-九 

(九 is chinese 9).

And I want to return:

9_9999999,A-9

My current function is:

def multiple_replace(myString):
    myString = re.sub(ur'(?u)[^\W_*\d]', u'A', myString)
    myString = re.sub(ur'(?u)[\d]', u'9', myString)
    return myString

EDITED:

Also tried...same result

def multiple_replace(myString):
    myLetters_regex = re.compile(r'[^\W\d_]', re.UNICODE)
    myNumbers_regex = re.compile(r'[\d]', re.UNICODE)
    myString = myNumbers_regex.sub('9', myString)
    myString = myLetters_regex.sub('A', myString)
    return myString

and I get...

9_9999999,A-A (i.e. 九 is recognized is flagged as an 'A' instead of a '9')

So, my q's are:

1) Is there any other way to write the \W to NOT include the numerics in the alphanumerics?

2) Is there something I am missing about recognizing Chinese numerals using python regex?

user14696
  • 657
  • 2
  • 10
  • 30
  • For #2, try setting the `re.UNICODE` flag when defining your regular expression. Still digging on #1 - your character class that excludes `\W_*\d` might be the best way to go, once the `\w` and `\d` classes are unicode-aware. Although `*` is not generally considered a word character so I don't think you need to explicitly forbid it. – Peter DeGlopper Nov 30 '13 at 22:36
  • AFAIK, the answer to #1 is no. The character class is fixed. You could however, define your own for convenience. – JDong Nov 30 '13 at 22:44
  • 1
    see this post: [find all chinese characters in python](http://stackoverflow.com/questions/2718196/find-all-chinese-text-in-a-string-using-python-and-regex) – gongzhitaao Nov 30 '13 at 22:44
  • What's the codepoint for that character, anyway? – Peter DeGlopper Nov 30 '13 at 22:49
  • @ Peter...see OP (edited). The re.UNICODE flag didn't make a difference. I know errors have been found in the python re module before...could this possibly be one? – user14696 Nov 30 '13 at 22:50
  • I doubt it - I would like to test this more closely, and knowing the code point in question would help. – Peter DeGlopper Nov 30 '13 at 22:51
  • @ Peter, I've been using this to get the hex... http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E4%B9%9D&mode=char – user14696 Nov 30 '13 at 22:52
  • If all else fails, this library claims to support all Unicode properties: https://pypi.python.org/pypi/regex – Peter DeGlopper Nov 30 '13 at 22:59
  • Digging pretty deep - the Unicode properties database defines the category for that codepoint to be `Lo`, or 'Letter, Other'. The Unihan database knows its numerical value (kPrimaryNumeric) but it looks like that isn't enough to make `\d` count it as a digit. I'm not sure anymore whether there is a way to match it with a character class. – Peter DeGlopper Nov 30 '13 at 23:14
  • @ Peter...I am testing a function using pypi...but what I would really like to know is how you found out that the unicode properties database has 九 as a 'Letter,Other'. I put in >> import unicodedata >>unicodedata.category(u'九') and I get "Unsupported characters in input".... – user14696 Dec 02 '13 at 00:13
  • @ Peter - are u using python 3 (the above error only occurs in 2.7) – user14696 Dec 02 '13 at 02:09

1 Answers1

0

Check Ponyguruma, a Python binding to the Oniguruma regular expression engine.


For numbers:

re.sub(ur'\p{N}', '9')

For letters:

re.sub(ur'\p{L}', 'A')
Ωmega
  • 42,614
  • 34
  • 134
  • 203
  • Um...Can you NOT install this on a Windows machine? I've tried (and even did a round in Cygwin -- ended up with a bunch of errors and 'error: command 'gcc' failed with exit status 1'). Looking over the GitHub repo didn't lead to much insight on this... – user14696 Dec 02 '13 at 01:49