Python regex tokenization of Unicode string not working as expected

Question

I encounter a strange problem with regular expression tokenization and Unicode strings.

> mystring = "Unicode rägular expressions"
> tokens = re.findall(r'\w+', mystring, re.UNICODE)

This is what I get:

> print tokens
['Unicode', 'r\xc3', 'gular', 'expressions']

This is what I expected:

> print tokens
['Unicode', 'rägular', 'expressions']

What do I have to do to get the expected result?

Update: This question is different from mine: matching unicode characters in python regular expressions But it's answer https://stackoverflow.com/a/5028826/1251687 would have solved my problem, too.

At issue here is that you are trying to match *encoded bytes*, not Unicode codepoints. — Martijn Pieters, Apr 18 '15 at 17:42
Wow, Python RegEx is somewhat different from what I’m used to. — Sebastian Simon, Apr 18 '15 at 17:44

score 2 · Accepted Answer · answered Apr 18 '15 at 17:41

2

The string must be unicode.

mystring = u"Unicode rägular expressions"
tokens = re.findall(r'\w+', mystring, re.UNICODE)

answered Apr 18 '15 at 17:41

Javier

2,752
15
30

That's it. Python Unicode is a massive headache... – boadescriptor Apr 18 '15 at 17:44
2

@boadescriptor: Watch http://nedbatchelder.com/text/unipain.html to reduce that headache. Stick to the Unicode sandwich, avoid handling encoded bytes as much as you can. – Martijn Pieters Apr 18 '15 at 17:46

score 1 · Answer 2 · answered Apr 18 '15 at 17:44

You have Latin-1 or Windows Codepage 1252 bytes, not Unicode text. Decode your input:

tokens = re.findall(r'\w+', mystring.decode('cp1252'), re.UNICODE)

An encoded byte can mean anything depending on the codec used, it is not a specific Unicode codepoint. For byte strings (type str) only ASCII characters can be matched when using \w.

Python regex tokenization of Unicode string not working as expected

2 Answers2