2

I need to get rid of polish characters from string I got from xml file. I use .replace() but in this case it doesn't work. Why? The code:

# -*- coding: utf-8
from prestapyt import PrestaShopWebService
from xml.etree import ElementTree

prestashop = PrestaShopWebService('http://localhost/prestashop/api', 
                              'key')
prestashop.debug = True

name = ElementTree.tostring(prestashop.search('products', options=
{'display': '[name]', 'filter[id]': '[2]'}), encoding='cp852',  
method='text')

print name
print name.replace('ł', 'l')

Output:

Naturalne mydło odświeżające
Naturalne mydło odświeżające

But when I try to replace non polish character it works fine.

print name
print name.replace('a', 'o')

Result:

Naturalne mydło odświeżające
Noturolne mydło odświeżojące

This also work's fine:

name = "Naturalne mydło odświeżające"
print name.replace('ł', 'l')

Any advise?

vex
  • 33
  • 5
  • You need to normalize the Unicode form of both strings to the same [normal form](https://en.m.wikipedia.org/wiki/Unicode_equivalence). – Daniel Pryden Sep 16 '17 at 21:37
  • Possible duplicate of [Can somone explain how unicodedata.normalize(form, unistr) work with examples?](https://stackoverflow.com/questions/14682397/can-somone-explain-how-unicodedata-normalizeform-unistr-work-with-examples) – Daniel Pryden Sep 16 '17 at 21:39

2 Answers2

2

If I understand your problem correctly, you can use unidecode:

>>> from unidecode import unidecode
>>> unidecode("Naturalne mydło odświeżające")
'Naturalne mydlo odswiezajace'

You might have to decode your cp852 encoded string with name.decode('utf_8') first.

Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
0

You are mixing encodings with your byte strings. Here's a short working example reproducing the issue. I assume you are running in a Windows console that defaults to an encoding of cp852:

#!python2
# coding: utf-8
from xml.etree import ElementTree as et
name_element = et.Element('data')
name_element.text = u'Naturalne mydło odświeżające'
name = et.tostring(name_element,encoding='cp852', method='text')
print name
print name.replace('ł', 'l')

Output (no replacement):

Naturalne mydło odświeżające
Naturalne mydło odświeżające

The reason is, the name string was encoded in cp852 but the byte string constant 'ł' is encoded in the source code encoding of utf-8.

print repr(name)
print repr('ł')

Output:

'Naturalne myd\x88o od\x98wie\xbeaj\xa5ce'
'\xc5\x82'

The best solution is to use Unicode strings:

#!python2
# coding: utf-8
from xml.etree import ElementTree as et
name_element = et.Element('data')
name_element.text = u'Naturalne mydło odświeżające'
name = et.tostring(name_element,encoding='cp852', method='text').decode('cp852')
print name
print name.replace(u'ł', u'l')
print repr(name)
print repr(u'ł')

Output (replacement was made):

Naturalne mydło odświeżające
Naturalne mydlo odświeżające
u'Naturalne myd\u0142o od\u015bwie\u017caj\u0105ce'
u'\u0142'

Note that Python 3's et.tostring has a Unicode option, and string constants are Unicode by default. The repr() version of the string is more readable as well, but ascii() implements the old behavior. You'll also find that Python 3.6 will print Polish even to consoles not using a Polish code page, so maybe you wouldn't need to replace the characters at all.

#!python3
# coding: utf-8
from xml.etree import ElementTree as et
name_element = et.Element('data')
name_element.text = 'Naturalne mydło odświeżające'
name = et.tostring(name_element,encoding='unicode', method='text')
print(name)
print(name.replace('ł','l'))
print(repr(name),repr('ł'))
print(ascii(name),ascii('ł'))

Output:

Naturalne mydło odświeżające
Naturalne mydlo odświeżające
'Naturalne mydło odświeżające' 'ł'
'Naturalne myd\u0142o od\u015bwie\u017caj\u0105ce' '\u0142'
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Thanks a lot! Encode/decode thing is still a bit tricky to me, so i guess I would have to study Unicode Howto. I will also consider moving to python 3.x. – vex Sep 17 '17 at 07:30