0

I encounter a strange problem with regular expression tokenization and Unicode strings.

> mystring = "Unicode rägular expressions"
> tokens = re.findall(r'\w+', mystring, re.UNICODE)

This is what I get:

> print tokens
['Unicode', 'r\xc3', 'gular', 'expressions']

This is what I expected:

> print tokens
['Unicode', 'rägular', 'expressions']

What do I have to do to get the expected result?

Update: This question is different from mine: matching unicode characters in python regular expressions But it's answer https://stackoverflow.com/a/5028826/1251687 would have solved my problem, too.

Community
  • 1
  • 1
boadescriptor
  • 735
  • 2
  • 9
  • 29

2 Answers2

2

The string must be unicode.

mystring = u"Unicode rägular expressions"
tokens = re.findall(r'\w+', mystring, re.UNICODE)
Javier
  • 2,752
  • 15
  • 30
1

You have Latin-1 or Windows Codepage 1252 bytes, not Unicode text. Decode your input:

tokens = re.findall(r'\w+', mystring.decode('cp1252'), re.UNICODE)

An encoded byte can mean anything depending on the codec used, it is not a specific Unicode codepoint. For byte strings (type str) only ASCII characters can be matched when using \w.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343