4

I need to decode a "UNICODE" encoded string:

>>> id = u'abcdß'
>>> encoded_id = id.encode('utf-8')
>>> encoded_id
'abcd\xc3\x9f'

The problem I have is: Using Pylons routing, I get the encoded_id variable as a unicode string u'abcd\xc3\x9f' instead of a just a regular string 'abcd\xc3\x9f':

Using python, how can I decode my encoded_id variable which is a unicode string?

>>> encoded_id = u'abcd\xc3\x9f'
>>> encoded_id.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/test/vng/lib64/python2.6/encodings/utf_8.py", line 16, in         decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128)
alloyoussef
  • 757
  • 3
  • 10
  • 24
  • If possible, you should figure out why you are getting strings from Pylons incorreclty decoded as `latin-1` (or it's close relative, `windows-1252`) instead of `utf-8` to begin with. – Mark Tolonen Sep 27 '13 at 23:32

1 Answers1

5

You have UTF-8 encoded data (there is no such thing as UNICODE encoded data).

Encode the unicode value to Latin-1, then decode from UTF8:

encoded_id.encode('latin1').decode('utf8')

Latin 1 maps the first 255 unicode points one-on-one to bytes.

Demo:

>>> encoded_id = u'abcd\xc3\x9f'
>>> encoded_id.encode('latin1').decode('utf8')
u'abcd\xdf'
>>> print encoded_id.encode('latin1').decode('utf8')
abcdß
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343