Python - wrong encoding, regexp

Question

I have a text in Polish language in which I want to filter out non-Polish letters, but the problem is that Polish specific letters disappear

# coding: utf-8
import re

_NOT_LETTERS = re.compile('[^a-ząćęłóńśżź]+')

text = u'dzień dobry i wszystkiego najlepszego życzę'

data = _NOT_LETTERS.sub(' ', text)

print data

and the result is

 dzie dobry i wszystkiego najlepszego ycz

instead of expected

dzień dobry i wszystkiego najlepszego życzę

How can I fix this ? I receive variable text from a third-party library

The pattern must use a unicode string too: `re.compile(u'[^a-ząćęłóńśżź]+')` otherwise multibyte characters are seen as separated bytes *(ie: one byte, one char)*. — Casimir et Hippolyte, May 24 '16 at 22:53
Great, it works. If you want add an answer and I'll accept it — Mateo2, May 24 '16 at 22:59

Casimir et Hippolyte · Accepted Answer · 2016-05-24T23:39:29.787

Accented letters are not in the ascii range and need several bytes when encoded in UTF-8, for example the character:

U+0144  ń       LATIN SMALL LETTER N WITH ACUTE

is encoded on two bytes: c5 84

When you write a string without specifying that it is a string with multibyte characters, each single byte is seen as a character (the character \xc5 and the character \x84 but not the character ń (U+0144) that isn't recognized.)

In Python 2.7 you need to specify that your string is a unicode string otherwise all multibyte characters are seen as single bytes. You can test it yourself writing:

>>> text = u'dzień'
>>> [c for c in text]
[u'd', u'z', u'i', u'e', u'\u0144']

>>> text = 'dzień'
>>> [c for c in text]
['d', 'z', 'i', 'e', '\xc5', '\x84']

Characters are not found because your pattern isn't in a unicode string like your subject string. You need to write:

re.compile(u'[^a-ząćęłóńśżź]+')

Python - wrong encoding, regexp

1 Answers1