Russian symbols in re (Python)

Question

I get a data from a file:

words = re.findall(r'[\w]+',self._from.encode('utf8'),re.U)

If the file contains:

Hi, how are you?

Then result will be:

['Hi', 'how', 'are', 'you']

But if the file contains russian language (i.e. cyrillic symbols), then:

Привет, как дела?

In this case the result is:

['\xd0', '\xd1', '\xd0', '\xd0\xb2\xd0\xb5\xd1', '\xd0\xba\xd0', '\xd0\xba', '\xd0', '\xd0\xb5\xd0', '\xd0']

why? wtf? I've already added:

sys.setdefaultencoding('utf-8')

I'm using python2.7 and linux ubuntu.

Answer:

words = re.findall(r'[\w]+',self._from.decode('utf8'),re.U)
print u" ".join(words)

score 10 · Accepted Answer · edited May 23 '17 at 12:00

To use \w+ to match alphanumeric unicode characters you should pass both a unicode pattern and unicode text to re.findall.

In Python2:

Assuming that you are reading bytes (not text) from the file, you should decode the bytes to obtain a unicode:
```
uni = 'Привет, как дела?'.decode('utf-8')
```
ur'(?u)\w+' is a raw unicode literal. Even though it is not necessary here, using raw unicode/string literals for regex patterns is generally a good practice -- it allows you to avoid the need for double backslashes before certain characters such as \s.

The regex pattern ur'(?u)\w+' bakes-in the Unicode flag which tells re.findall to make \w dependent on the Unicode character properties database.
```
import re
uni = 'Привет, как дела?'.decode('utf-8')
print(re.findall(ur'(?u)\w+', uni))
```
yields a list containing the 3 unicode "words":
```
[u'\u041f\u0440\u0438\u0432\u0435\u0442',
 u'\u043a\u0430\u043a',
 u'\u0434\u0435\u043b\u0430']
```
In Python3:

The general principle is the same, except that what were unicodes in Python2 are now strs in Python3, and there is no longer any attempt at automatic conversion between the two. So, again assuming that you are reading bytes (not text) from the file, you should decode the bytes to obtain a str, and use a str regex pattern:
```
import re
uni = b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, \xd0\xba\xd0\xb0\xd0\xba \xd0\xb4\xd0\xb5\xd0\xbb\xd0\xb0?'.decode('utf')
print(re.findall(r'(?u)\w+', uni))
```
yields
```
['Привет', 'как', 'дела']
```

score 4 · Answer 2 · answered Jul 14 '18 at 20:55

4

My solution:

txt = re.findall(r'[А-я]+', data)

А-я - Russian alphabet letters

answered Jul 14 '18 at 20:55

Dmitry

536
6
10

score 2 · Answer 3 · answered Apr 12 '21 at 11:02

Consult UTF Cyrillic block to define regex precisely:

Most codepoints are in a range, but some are not:

re.compile('[А-Яа-яЁё]+')

re.fullmatch("[А-Яа-яЁё ]+", "Ёжик в тумане")

Also you might want to include Ѣ ѣ (Ять) or other old symbols depending on your needs.

score 1 · Answer 4 · answered Mar 16 '13 at 10:59

you are taking a string that is already unicode and encoding it as unicode if you omit the encoding part you get:

line = u"Привет, как дела?"
words = re.findall(r'[\w]+',line ,re.U)
# words = [u'\u041f\u0440\u0438\u0432\u0435\u0442', u'\u043a\u0430\u043a', u'\u0434\u0435\u043b\u0430']
print words[0]
# prints Привет

score 0 · Answer 5 · answered Mar 16 '13 at 10:53

0

If self._from is an unicode string, you should pass it directly to re.findall (with the re.U flag). If it's an utf8-encoded str string, you need to decode it into an unicode string. You shouldn't pass non-ascii str strings to re.

answered Mar 16 '13 at 10:53

wRAR

25,009
4
84
97

score 0 · Answer 6 · answered Jan 05 '22 at 21:11

0

Russian alphabet solution with Ё letter (it is not included in А-Я range)

import re

text = 'Ё-моё Привет! 2121 как дела?'

re.findall(r'[А-яЁё]+', text)
# => ['Ё', 'моё', 'Привет', 'как', 'дела']

answered Jan 05 '22 at 21:11

mechnicov

12,025
4
33
56

Russian symbols in re (Python)

Answer:

6 Answers6