There is no [\p{Ll}\p{Lo}\ 1 in python, and I'm struggling to write a regular expression that recognizes unicode...and doesn't confuse punctuation such as '-' or add funny diacritics when the script encounters a phonetic mark (like 'ô' or 'طس').
My goal is to label ALL letters (ASCII and any unicode) and return an "A". A number [1-9] as a 9.
My current function is:
def multiple_replace(myString):
myString = re.sub(r'(?u)[^\W\d_]|-','A', myString)
myString = re.sub(r'[0-9]', '9', myString)
return myString
The returns I am getting are (notice the incosistency in how '-' is being labeled...sometimes as an 'A' sometimes as a 'Aœ'):
TX 35-L | AA 99AA
М-21 | AAœA99
A 1 طس | A 9 A~˜A·A~AA
US-50 | AAA99
yeni sinop-erfelek yolu çevre yolu | AAAA AAAAAAAAAAAAA AAAA AƒA§AAAA AAAA
Av Antônio Ribeiro | AA AAAAƒA´AAA AAAAAAA
What I need to get is this:
TX 35-L | AA 99-A
М-21 | A-99
A 1 طس | A 9 AAAAA
US-50 | AA-99
yeni sinop-erfelek yolu çevre yolu | AAAA AAAAAAAAAAAAA AAAA AAAAAAAA AAAA
Av Antônio Ribeiro | AA AAAAAAAAAA AAAAAAA
...is it even possible (with python re 2.7) to commonly identify ALL UTF-8 characters that ARE NOT common punctuation marks (i.e. '()', ',', '.', '-', etc) and NOT 1-9 numbers without [\p{Ll}\p{Lo}\?