Regexp for non-ASCII characters

Question

Consider this snippet using regular expressions in Python 3:

>>> t = "Meu cão é #paraplégico$."
>>> re.sub("[^A-Za-z0-9 ]","",t,flags=re.UNICODE)
'Meu co  paraplgico'

Why does it delete non-ASCII characters? I tried without the flag and it's all the same.

As a bonus, can anyone make this work on Python 2.7 as well?

Because `a-z` is `abcdef...xyz` and this does not include `ã`. If you want all word characters, use `\w`. — Has QUIT--Anony-Mousse, Mar 05 '13 at 12:53

score 4 · Answer 1 · edited Mar 17 '13 at 13:44

4

You are substituting non-alphanumeric characters([^A-Za-z0-9 ]) with blank(""). The non-ASCII characters are not among A-Z, a-z, or 0-9, so they get substituted.

You can match all word characters like this:

>>> t = "Meu cão é #paraplégico$."
>>> re.sub("[^\w ]","",t, flags=re.UNICODE)
>>> 'Meu cão é paraplégico'

Or you could add the characters into your regex like so: [^A-Za-z0-9ãé ].

edited Mar 17 '13 at 13:44

dda

6,030
2
25
34

answered Mar 05 '13 at 12:12

Yeonho

3,629
4
39
61

Yep, I got it! but What is the equivalent of A-Za-z in Unicode? – fccoelho Mar 05 '13 at 12:17
2

In many (other) languages you could use Unicode properties to define a regex of `[^\p{Alpha} ]`. See http://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties for alternatives in Python. – Joe Mar 05 '13 at 12:40

score 3 · Accepted Answer · answered Mar 05 '13 at 12:52

3

[In 1]: import regex
[In 2]: t = u"Meu cão é #paraplégico$."
[In 3]: regex.sub(r"[^\p{Alpha} ]","",t,flags=regex.UNICODE)
[In 4]: print(regex.sub(r"[^\p{Alpha} ]","",t,flags=regex.UNICODE))

Meu cão é paraplégico

answered Mar 05 '13 at 12:52

dda

6,030
2
25
34

score 0 · Answer 3 · answered Mar 05 '13 at 12:56

0

I solved this by switching to the regex library (from PyPI).

then the regex command became:

regex.sub(ur"[^\p{L}\p{N} ]+", u"", t)

answered Mar 05 '13 at 12:56

fccoelho

6,012
10
55
67

Regexp for non-ASCII characters

3 Answers3