Python regex and Unicode

Question

I am currently trying to figure out how to use Unicode in a regex in Python.

The regex I want to get to work is the following:

r"([A-ZÜÖÄß]+\s)+"

This should include all occurences of multiple capitalized words, that may or may not have Umlauts in them. Funnily enouth it will do nearly what I wanted, but it still ignores Umlauts.

For example, in FUßBALL AND MORE only BALL AND MORE should be detected.

I already tried to simply use the Unicode representations (Ü becomes \u00DC etc.), as it was advised in another thread, but that does not work too. Instead I might try to use the "regex" library instead of "re", but I kindoff want to know what I am doing wrong right now.

If you are able to enlighten me, please feel free to do so.

Well that makes sense, yes I am using Python version 2.7.12 ----- Cool. That does mean that I don't misunderstand regexes (I feared to just have produced a realy stupid regex ;D ) — Junge, Oct 05 '17 at 09:10
Replacing the Chars with their ISO representation worked like a charm. ---> r'(?:[A-Z\xC4\xD6\xDC\xDF]+\s)+' Do you mind posting your comment as an answer? Then I could accept that and close the question. Thank you a lot, by the way! — Junge, Oct 05 '17 at 09:48
I'll look over it as soon as I am back at my workdesk. I can't upvote you any more. Somebody must have downvoted your stuff - for reasons i suppose... — Junge, Oct 08 '17 at 11:33
Yes. Adding the 'u' seems to work well. I changed the answer status accordingly. — Junge, Oct 09 '17 at 06:44
So, that means it is another duplicate of a very popular question. Closed as such. — Wiktor Stribiżew, Oct 09 '17 at 06:49

Mark Tolonen · Accepted Answer · 2017-10-06T05:32:55.647

Use Unicode strings. Make sure your source is saved in the declared encoding:

#coding:utf8
import re

for s in re.finditer(ur"[A-ZÜÖÄß]+",u"FUßBALL AND MORE"):
    print s.group()

Output:

FUßBALL
AND
MORE

Without Unicode strings, your byte strings are in the encoding of your source file. If that is UTF-8, they are multi-byte for non-ASCII. You will still have problems with Unicode strings in a narrow Python build, but only if you use Unicode codepoints >U+FFFF (such as emoji) as they will be encoded using UTF-16 surrogates (two codepoints). In that case, switch to the latest Python 3.x where the problem was solved and all Unicode codepoints have a length of 1.

Python regex and Unicode

1 Answers1