Regular expressions dont like international characters

Question

Possible Duplicate:
matching unicode characters in python regular expressions

Using

re.findall(r'\w+', ip)

on Fältskog returns F and ltskog. I tried with both strings and unicode but the same. result

You need to specify the re.LOCALE and re.UNICODE flags. (If you want to depend on the current locale, otherwise, re.UNICODE will match all alphanumeric in all languages). — nhahtdh, Sep 22 '12 at 07:01

Sean Vieira · Accepted Answer · 2012-09-22T07:14:07.220

5

You need to set the appropriate flags (in this case UNICODE to tell re what \w means):

re.findall(r'\w+', ip, re.UNICODE)

# EDIT

Python 2.7.3 (default, Aug  1 2012, 05:16:07) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.findall(r"\w+", u"Fältskog", re.UNICODE)
[u'F\xe4ltskog']
>>>

edited Sep 22 '12 at 07:14

answered Sep 22 '12 at 07:01

Sean Vieira

155,703
32
311
293

This returned `['F\xc3', 'ltskog']`, not as a single word. Please check it. – Jesvin Jose Sep 22 '12 at 07:11
My bad, I gave it a non-unicode string to start with. Accepted this. – Jesvin Jose Sep 22 '12 at 07:13

score 0 · Answer 2 · answered Sep 22 '12 at 07:16

0

re.findall(r'[åäöÅÄÖ\w]+', ip)

You can also do this if you want to be more visual.

answered Sep 22 '12 at 07:16

Pablo Jomer

9,870
11
54
102

Regular expressions dont like international characters

2 Answers2