1

I have a question about Python 2 encoding. I am trying to decode an ASCII string which contains Unicode code of a letter to Unicode, and then encode it back to Latin-1, but with no success. Here is an illustration:

In[27]: d = u'\u010d'

In[28]: print d.encode('utf-8')

č

In[29]: d1 = '\u010d'

In[30]: d1.decode('ascii').encode('utf-8')

Out[30]: '\\u010d'

I would like to convert '\u010d' to 'č'. Are there any built-in solutions to avoid custom string replacement?

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55
S. Nestic
  • 13
  • 1
  • 3
  • Firstly, an ASCII string can never contain accented characters like č, because they are not part of ASCII. Python is strict about that, it doesn't interpret ASCII as "anything which uses one byte per character". Now, if you used Python's `unicode` instead of `str` to store strings, you could actually store that character and perhaps also convert it to the Latin-1 bytewise representation. I'd suggest you update to Python 3 though, as it is better designed concerning different encodings. – Ulrich Eckhardt Mar 21 '16 at 08:50

1 Answers1

1

When you do

d1 = '\u010d'

you actually get this string:

In [3]: d1
Out[3]: '\\u010d'

This is because "normal" (non-Unicode) strings don't recognize the \unnnn escape sequence and therefore convert it to a literal backslash, followed by unnnn.

In order to decode that, you need to use the unicode_escape codec:

In [4]: print d1.decode("unicode_escape").encode('utf-8')
č

But of course you shouldn't use Unicode escape sequences in non-Unicode strings in the first place.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561