9

I'm looking for the equivalent of [\w]&&[^\d] (Of course && is not a regex operator). The regex needs to match ONLY words made up of UTF8 "alphabet" characters. Does anyone have any ideas?

Thomas
  • 11,757
  • 4
  • 41
  • 57

5 Answers5

9

regex supports Unicode properties, which means that you can use \p{L} with it.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
1

As Ignacio pointed out [a-zA-Z] would not match Unicode characters, and there is no character class predefined for all Unicode characters, you may want to use something similar to the following, which would be simple and straightforward

re.findall("(["+string.letters+"])+",st)

Please note, string.letters is locale dependent and unless you want to switch the local, which you can off-course do with locale.setlocale(locale.LC_CTYPE, code), this should work as a breeze.

Abhijit
  • 62,056
  • 18
  • 131
  • 204
0

AFAICT, there isn't a regex that matches all letters but not digits or underscores.

You could use \w and then check to see if the matches are letters using the code point properties:

def isletter(c):
    return unicodedata.category(c).startswith('L')
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
-1

Not sure about regex, but for unicode you might be able to make use of the uncodedata module; specifically the unicodedata.category() function

Preet Kukreti
  • 8,417
  • 28
  • 36
-6

Use [a-zA-Z] to match all the alphabet characters.

Steven You
  • 439
  • 5
  • 13