I encounter a strange problem with regular expression tokenization and Unicode strings.
> mystring = "Unicode rägular expressions"
> tokens = re.findall(r'\w+', mystring, re.UNICODE)
This is what I get:
> print tokens
['Unicode', 'r\xc3', 'gular', 'expressions']
This is what I expected:
> print tokens
['Unicode', 'rägular', 'expressions']
What do I have to do to get the expected result?
Update: This question is different from mine: matching unicode characters in python regular expressions But it's answer https://stackoverflow.com/a/5028826/1251687 would have solved my problem, too.