Writing python regex that recognizes all unicode letters

Question

There is no [\p{Ll}\p{Lo}\ 1 in python, and I'm struggling to write a regular expression that recognizes unicode...and doesn't confuse punctuation such as '-' or add funny diacritics when the script encounters a phonetic mark (like 'ô' or 'طس').

My goal is to label ALL letters (ASCII and any unicode) and return an "A". A number [1-9] as a 9.

My current function is:

def multiple_replace(myString):
    myString = re.sub(r'(?u)[^\W\d_]|-','A', myString)
    myString = re.sub(r'[0-9]', '9', myString)
    return myString

The returns I am getting are (notice the incosistency in how '-' is being labeled...sometimes as an 'A' sometimes as a 'Aœ'):

TX 35-L | AA 99AA
М-21 | AAœA99
A 1 طس | A 9 A~˜A·A~AA
US-50 | AAA99
yeni sinop-erfelek yolu çevre yolu | AAAA AAAAAAAAAAAAA AAAA AƒA§AAAA AAAA
Av Antônio Ribeiro | AA AAAAƒA´AAA AAAAAAA

What I need to get is this:

TX 35-L | AA 99-A
М-21 | A-99
A 1 طس | A 9 AAAAA
US-50 | AA-99
yeni sinop-erfelek yolu çevre yolu | AAAA AAAAAAAAAAAAA AAAA AAAAAAAA AAAA
Av Antônio Ribeiro | AA AAAAAAAAAA AAAAAAA

...is it even possible (with python re 2.7) to commonly identify ALL UTF-8 characters that ARE NOT common punctuation marks (i.e. '()', ',', '.', '-', etc) and NOT 1-9 numbers without [\p{Ll}\p{Lo}\?

The meaning of most of the Python character classes in regular expressions is controlled by the `LOCALE` and `UNICODE` flags on the regexp. I haven't tested your exact situation, but with `re.UNICODE` set `\w` and `\W` use the Unicode character database to determine what counts as alphanumeric. — Peter DeGlopper, Nov 22 '13 at 00:53
@1st1: 2.7. Tried this. No dice. :( 'def multiple_replace(myString): myUnicodeLetters_regex = re.compile(r'(?u)[^\W\d_]|-', re.UNICODE) myNumbers_regex = re.compile(r'[0-9]', re.UNICODE) myString = myUnicodeLetters_regex.sub('A', myString) myString = myNumbers_regex.sub('9', myString) return myString' — user14696, Nov 22 '13 at 04:33
Can you try python 3.3? It should have better unicode support in `re` module too. — 1st1, Nov 22 '13 at 04:35

score 2 · Answer 1 · answered Nov 22 '13 at 07:30

2

If using Python 2.7, use Unicode strings. I'm assuming your "What I need" examples are incorrect, or do you really want AAAAA for طس? If reading the strings from a file, decode the strings to Unicode first.

#!python2
#coding: utf8
import re

# Note leading u
data = u'TX 35-L|М-21|A 1 طس|US-50|yeni sinop-erfelek yolu çevre yolu|Av Antônio Ribeiro'.split('|')

for d in data:
    r = re.sub(ur'(?u)[^\W\d_]',u'A', d)
    r = re.sub(ur'[0-9]', u'9', r)
    print d
    print r
    print

Output:

TX 35-L
AA 99-A

М-21
A-99

A 1 طس
A 9 AA

US-50
AA-99

yeni sinop-erfelek yolu çevre yolu
AAAA AAAAA-AAAAAAA AAAA AAAAA AAAA

Av Antônio Ribeiro
AA AAAAAAA AAAAAAA

answered Nov 22 '13 at 07:30

Mark Tolonen

166,664
26
169
251

Can you maybe explain the difference in using 'ur' and (?u)...I thought (?u) WAS a flag looking for that proceeding u'...'. – user14696 Nov 22 '13 at 18:11
[(?u)](http://docs.python.org/2.7/library/re.html?highlight=re.sub#re.UNICODE) means make `\w` and `\d` for example use the Unicode character properties database for deciding if a character is alphanumeric or a digit; otherwise, only the ASCII definition is used. Preceding a string with `u` makes it a Unicode string vs. a byte string, and `r` makes it a raw string (`r'\n'` is two characters instead of a single newline). Also check out this presentation: http://nedbatchelder.com/text/unipain.html – Mark Tolonen Nov 22 '13 at 19:06
okay....one question though....for "A 1 طس" I actually get "A 9 AAAAAAAA"....is that just not what you get? – user14696 Nov 23 '13 at 00:45
No, that is two Arabic letters (TAH and SEEN). To get eight As the string must be UTF32-encoded. That is why you should process your strings in Unicode. – Mark Tolonen Nov 23 '13 at 04:14
Another recommended article on Unicode: http://www.joelonsoftware.com/articles/Unicode.html – Mark Tolonen Nov 23 '13 at 04:19
Thanks M. Tolonen. Dead-on! – user14696 Nov 25 '13 at 03:24

score -2 · Accepted Answer · answered Nov 25 '13 at 03:24

Not sure why my answer just got deleted, but here is what I went forth with:

function (regex):

def multiple_replace(myString):
    myString = re.sub(ur'(?u)[^\W\d_]', u'A', myString)
    myString = re.sub(ur'[0-9]', u'9', myString)
    return myString

call (w/ decoding):

with codecs.open(r'test5.txt', 'w', 'utf-8') as outfile1:
    for row in reader:
        unicode_row = [x.decode('utf-8') for x in row]
        item = unicode_row[csv_col_index]
        outfile1.write(row[1] + "," + item + "," + multiple_replace(item) + "\n")

Writing python regex that recognizes all unicode letters

2 Answers2