Python regex to match non-ascii names

Question

I'm trying to validate name fields with the re module.

\w doesn't match non-ascii chars such as à.

It seems that in many other regex engines, the solution would have been \p{L}, but this isn't supported in python as it appears. What would be a suitable equivalent?

Update:

This is different from other questions around this topic, in that I'm looking for the unicode alternative to \w which isn't the one obtained using the default re.UNICODE flag (since this flag also makes \w match digits and underscores).

Are you using the [LOCALE](http://docs.python.org/2/library/re.html#re.LOCALE) and/or [UNICODE](http://docs.python.org/2/library/re.html#re.UNICODE) flags? — BrenBarn, Mar 03 '13 at 19:12
See http://stackoverflow.com/questions/238223/match-unicode-in-plys-regexes for a similar (duplicate?) question. — Michael Scott Asato Cuthbert, Mar 03 '13 at 19:22
@BrenBarn I've tried re.UNICODE but it's not suitable since it also matches digits and underscores — GJ., Mar 04 '13 at 08:56
@GJ.: to your update: the indicated duplicate states "and I also need a regex that does **not match numbers**." Is the underscore-handling the only difference then? — quetzalcoatl, Mar 04 '13 at 09:20
\w matches digits and underscores regardless of the UNICODE flag being set. — Peter Graham, Mar 06 '13 at 03:52

score 1 · Answer 1 · answered Mar 03 '13 at 19:21

1

I believe you need to enable unicode support for character classes, with the UNICODE modifier.

regexRef = re.compile("\w", re.UNICODE)

See if that helps to match those non-ASCII characters.

answered Mar 03 '13 at 19:21

Jim Black

1,422
1
13
26

1

re.UNICODE doesn't solve this problem since it also matches digits and underscores. – GJ. Mar 04 '13 at 08:55

score 1 · Answer 2 · answered Mar 06 '13 at 03:59

1

Does [^\d\s_] match what you want?

answered Mar 06 '13 at 03:59

Peter Graham

11,323
7
40
42

score 1 · Accepted Answer · answered Mar 06 '13 at 08:16

1

[^\W0-9_] works for me, when used together with re.UNICODE

answered Mar 06 '13 at 08:16

GJ.

5,226
13
59
82

@quetzalcoatl thanks for the reference, this was hiding in a partial form inside it. – GJ. Mar 06 '13 at 08:17

score 0 · Answer 4 · answered Mar 03 '13 at 19:29

0

Pass Unicode strings to re module and enable re.UNICODE flag, example:

# -*- coding: utf-8 -*-
import re

print(re.findall(ur"\w+", ur"\w does match à.", flags=re.UNICODE))

answered Mar 03 '13 at 19:29

jfs

399,953
195
994
1,670

Python regex to match non-ascii names

4 Answers4