4

I'm trying to validate name fields with the re module.

\w doesn't match non-ascii chars such as à.

It seems that in many other regex engines, the solution would have been \p{L}, but this isn't supported in python as it appears. What would be a suitable equivalent?

Update:

This is different from other questions around this topic, in that I'm looking for the unicode alternative to \w which isn't the one obtained using the default re.UNICODE flag (since this flag also makes \w match digits and underscores).

j0k
  • 22,600
  • 28
  • 79
  • 90
GJ.
  • 5,226
  • 13
  • 59
  • 82

4 Answers4

1

I believe you need to enable unicode support for character classes, with the UNICODE modifier.

regexRef = re.compile("\w", re.UNICODE)

See if that helps to match those non-ASCII characters.

Jim Black
  • 1,422
  • 1
  • 13
  • 26
  • 1
    re.UNICODE doesn't solve this problem since it also matches digits and underscores. – GJ. Mar 04 '13 at 08:55
1

Does [^\d\s_] match what you want?

Peter Graham
  • 11,323
  • 7
  • 40
  • 42
1

[^\W0-9_] works for me, when used together with re.UNICODE

GJ.
  • 5,226
  • 13
  • 59
  • 82
  • @quetzalcoatl thanks for the reference, this was hiding in a partial form inside it. – GJ. Mar 06 '13 at 08:17
0

Pass Unicode strings to re module and enable re.UNICODE flag, example:

# -*- coding: utf-8 -*-
import re

print(re.findall(ur"\w+", ur"\w does match à.", flags=re.UNICODE))
jfs
  • 399,953
  • 195
  • 994
  • 1,670