How to match umlauts with regular expressions?

Question

In German text, umlauts (ä, ü, ö) and eszett (ß) are regular letters, but they don't seem to be covered by the \w special character:

In [1]: re.match('(\w+)', 'Straße').groups()
Out[1]: ('Stra',)

Passing the re.UNICODE flag to re.match doesn't change anything.

Is there any better way to match a full word other than with [a-zA-ZäüöÄÜÖß]+?

I cannot repro: see [`re.match(ur'(\w+)', u'Straße', flags=re.U).group(1).encode("utf8")`](https://ideone.com/R1xTej), it prints `Straße`. Maybe you just missed `u""` prefixes? `\w` covers all Unicode letters in fact when you pass the `re.U` flag. — Wiktor Stribiżew, May 13 '16 at 13:09
@WiktorStribiżew you should post that as answer. that is the answer. I get the same result as @elpres when I use his code. it definitely needs the `u` prefix. — Tom Myddeltyn, May 13 '16 at 13:12
@WiktorStribiżew True, the `u''` prefix does indeed solve the problem. — elpres, May 13 '16 at 13:18

Keozon · Accepted Answer · 2016-05-13T13:38:00.830

7

Since you are using python 2, you need to use unicode strings:

print re.match(ur'(\w+)',u'Straße',re.UNICODE).groups()[0]
Straße

edited May 13 '16 at 13:38

answered May 13 '16 at 13:12

Keozon

You don't need the `u` in `u'(\w+)'` but it doesn't hurt. – Tom Myddeltyn May 13 '16 at 13:14
You're right, it works when there are both a `u''` string and the `re.UNICODE` flag. Thanks! – elpres May 13 '16 at 13:14
1

A hint: use raw string literals when defining a regex pattern. – Wiktor Stribiżew May 13 '16 at 13:31
Yes, this is true. I was rushing and it didn't happen to matter in this instance. – Keozon May 13 '16 at 13:37

1 Answers1