0

I am currently trying to figure out how to use Unicode in a regex in Python.

The regex I want to get to work is the following:

r"([A-ZÜÖÄß]+\s)+"

This should include all occurences of multiple capitalized words, that may or may not have Umlauts in them. Funnily enouth it will do nearly what I wanted, but it still ignores Umlauts.

For example, in FUßBALL AND MORE only BALL AND MORE should be detected.

I already tried to simply use the Unicode representations (Ü becomes \u00DC etc.), as it was advised in another thread, but that does not work too. Instead I might try to use the "regex" library instead of "re", but I kindoff want to know what I am doing wrong right now.

If you are able to enlighten me, please feel free to do so.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Junge
  • 437
  • 6
  • 14
  • Well that makes sense, yes I am using Python version 2.7.12 ----- Cool. That does mean that I don't misunderstand regexes (I feared to just have produced a realy stupid regex ;D ) – Junge Oct 05 '17 at 09:10
  • Replacing the Chars with their ISO representation worked like a charm. ---> r'(?:[A-Z\xC4\xD6\xDC\xDF]+\s)+' Do you mind posting your comment as an answer? Then I could accept that and close the question. Thank you a lot, by the way! – Junge Oct 05 '17 at 09:48
  • I'll look over it as soon as I am back at my workdesk. I can't upvote you any more. Somebody must have downvoted your stuff - for reasons i suppose... – Junge Oct 08 '17 at 11:33
  • Yes. Adding the 'u' seems to work well. I changed the answer status accordingly. – Junge Oct 09 '17 at 06:44
  • So, that means it is another duplicate of a very popular question. Closed as such. – Wiktor Stribiżew Oct 09 '17 at 06:49

1 Answers1

0

Use Unicode strings. Make sure your source is saved in the declared encoding:

#coding:utf8
import re

for s in re.finditer(ur"[A-ZÜÖÄß]+",u"FUßBALL AND MORE"):
    print s.group()

Output:

FUßBALL
AND
MORE

Without Unicode strings, your byte strings are in the encoding of your source file. If that is UTF-8, they are multi-byte for non-ASCII. You will still have problems with Unicode strings in a narrow Python build, but only if you use Unicode codepoints >U+FFFF (such as emoji) as they will be encoded using UTF-16 surrogates (two codepoints). In that case, switch to the latest Python 3.x where the problem was solved and all Unicode codepoints have a length of 1.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251