3

I have a text in Polish language in which I want to filter out non-Polish letters, but the problem is that Polish specific letters disappear

# coding: utf-8
import re

_NOT_LETTERS = re.compile('[^a-ząćęłóńśżź]+')

text = u'dzień dobry i wszystkiego najlepszego życzę'

data = _NOT_LETTERS.sub(' ', text)

print data

and the result is

 dzie dobry i wszystkiego najlepszego ycz 

instead of expected

dzień dobry i wszystkiego najlepszego życzę

How can I fix this ? I receive variable text from a third-party library

Mateo2
  • 33
  • 2

1 Answers1

1

Accented letters are not in the ascii range and need several bytes when encoded in UTF-8, for example the character:

U+0144  ń       LATIN SMALL LETTER N WITH ACUTE

is encoded on two bytes: c5 84

When you write a string without specifying that it is a string with multibyte characters, each single byte is seen as a character (the character \xc5 and the character \x84 but not the character ń (U+0144) that isn't recognized.)

In Python 2.7 you need to specify that your string is a unicode string otherwise all multibyte characters are seen as single bytes. You can test it yourself writing:

>>> text = u'dzień'
>>> [c for c in text]
[u'd', u'z', u'i', u'e', u'\u0144']

>>> text = 'dzień'
>>> [c for c in text]
['d', 'z', 'i', 'e', '\xc5', '\x84']

Characters are not found because your pattern isn't in a unicode string like your subject string. You need to write:

re.compile(u'[^a-ząćęłóńśżź]+')
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125