2

Consider this snippet using regular expressions in Python 3:

>>> t = "Meu cão é #paraplégico$."
>>> re.sub("[^A-Za-z0-9 ]","",t,flags=re.UNICODE)
'Meu co  paraplgico'

Why does it delete non-ASCII characters? I tried without the flag and it's all the same.

As a bonus, can anyone make this work on Python 2.7 as well?

dda
  • 6,030
  • 2
  • 25
  • 34
fccoelho
  • 6,012
  • 10
  • 55
  • 67

3 Answers3

4

You are substituting non-alphanumeric characters([^A-Za-z0-9 ]) with blank(""). The non-ASCII characters are not among A-Z, a-z, or 0-9, so they get substituted.

You can match all word characters like this:

>>> t = "Meu cão é #paraplégico$."
>>> re.sub("[^\w ]","",t, flags=re.UNICODE)
>>> 'Meu cão é paraplégico'

Or you could add the characters into your regex like so: [^A-Za-z0-9ãé ].

dda
  • 6,030
  • 2
  • 25
  • 34
Yeonho
  • 3,629
  • 4
  • 39
  • 61
  • Yep, I got it! but What is the equivalent of A-Za-z in Unicode? – fccoelho Mar 05 '13 at 12:17
  • 2
    In many (other) languages you could use Unicode properties to define a regex of `[^\p{Alpha} ]`. See http://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties for alternatives in Python. – Joe Mar 05 '13 at 12:40
3
[In 1]: import regex
[In 2]: t = u"Meu cão é #paraplégico$."
[In 3]: regex.sub(r"[^\p{Alpha} ]","",t,flags=regex.UNICODE)
[In 4]: print(regex.sub(r"[^\p{Alpha} ]","",t,flags=regex.UNICODE))

Meu cão é paraplégico

dda
  • 6,030
  • 2
  • 25
  • 34
0

I solved this by switching to the regex library (from PyPI).

then the regex command became:

regex.sub(ur"[^\p{L}\p{N} ]+", u"", t)
fccoelho
  • 6,012
  • 10
  • 55
  • 67