What Python regex matches all alphabet characters but no numbers? [unicode aware]

Question

I'm looking for the equivalent of [\w]&&[^\d] (Of course && is not a regex operator). The regex needs to match ONLY words made up of UTF8 "alphabet" characters. Does anyone have any ideas?

http://stackoverflow.com/questions/8923949/matching-only-a-unicode-letter-in-python-re — warvariuc, Apr 03 '12 at 06:11
Are you talking about the English alphabet? Then the answers [a-zA-Z] below will suffice. Otherwise you're in for a treat... — Jonas Byström, Apr 03 '12 at 06:12
"NEVER perform regexs on encoded text." This for internationalized URL matching. Not longform text. — Thomas, Apr 03 '12 at 07:18
@IgnacioVazquez-Abrams "NEVER perform regexs on encoded text." How come there is an re.UNICODE flag then? I guess things break for you when you're not using that flag. — bpj, Jul 09 '16 at 13:32
@bpj: `re.UNICODE` doesn't make `re` work on encoded text, it makes various special sequences match non-ASCII characters. — Ignacio Vazquez-Abrams, Jul 09 '16 at 14:45

score 9 · Accepted Answer · answered Apr 03 '12 at 06:21

9

regex supports Unicode properties, which means that you can use \p{L} with it.

answered Apr 03 '12 at 06:21

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

score 1 · Answer 2 · answered Apr 03 '12 at 06:21

As Ignacio pointed out [a-zA-Z] would not match Unicode characters, and there is no character class predefined for all Unicode characters, you may want to use something similar to the following, which would be simple and straightforward

re.findall("(["+string.letters+"])+",st)

Please note, string.letters is locale dependent and unless you want to switch the local, which you can off-course do with locale.setlocale(locale.LC_CTYPE, code), this should work as a breeze.

score 0 · Answer 3 · answered Apr 03 '12 at 06:20

0

AFAICT, there isn't a regex that matches all letters but not digits or underscores.

You could use \w and then check to see if the matches are letters using the code point properties:

def isletter(c):
    return unicodedata.category(c).startswith('L')

answered Apr 03 '12 at 06:20

Raymond Hettinger

216,523
63
388
485

score -1 · Answer 4 · answered Apr 03 '12 at 06:13

-1

Not sure about regex, but for unicode you might be able to make use of the uncodedata module; specifically the unicodedata.category() function

answered Apr 03 '12 at 06:13

Preet Kukreti

8,417
28
36

score -6 · Answer 5 · answered Apr 03 '12 at 06:09

-6

Use [a-zA-Z] to match all the alphabet characters.

answered Apr 03 '12 at 06:09

Steven You

439
5
13

Incorrect. This will miss "あ". – Ignacio Vazquez-Abrams Apr 03 '12 at 06:09
That's not a alphabet character. – Steven You Apr 03 '12 at 06:20
11

Yes it is. It's just not an English alphabet character. – Ignacio Vazquez-Abrams Apr 03 '12 at 06:21
1

"éléphant" : the 'é' wouldn't match. – Guillaume Lebourgeois Apr 16 '13 at 08:59

What Python regex matches all alphabet characters but no numbers? [unicode aware]

5 Answers5