0

I'm trying to print the following unicode string but I'm receiving a UnicodeDecodeError: 'ascii' codec can't decode byte error. Can you please help form this query so it can print the unicode string properly?

>>> from __future__ import unicode_literals
>>> ts='now'
>>> free_form_request='[EXID(이엑스아이디)] 위아래 (UP&DOWN) MV'
>>> nick='me'

>>> print('{ts}: free form request {free_form_request} requested from {nick}'.format(ts=ts,free_form_request=free_form_request.encode('utf-8'),nick=nick))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 6: ordinal not in range(128)

Thank you very much in advance!

Preston Connors
  • 397
  • 3
  • 14
  • Try [ignoring the errors](https://docs.python.org/2/library/stdtypes.html#str.encode): `free_form_request.encode('utf-8', errors='ignore')` – Peter Wood May 15 '15 at 13:29
  • @PeterWood This won't work, the problem happens when the string is decoded after being encoded. Note that the string is already unicode due to the `unicode_literals` import. – Thomas Orozco May 15 '15 at 13:39
  • @ThomasOrozco Ah, I missed that. – Peter Wood May 15 '15 at 14:07

1 Answers1

4

Here's what happen when you construct this string:

'{ts}: free form request {free_form_request} requested from {nick}'.format(ts=ts,free_form_request=free_form_request.encode('utf-8'),nick=nick)
  1. free_form_request is encode-d into a byte string using utf-8 as the encoding. This works because utf-8 can represent [EXID(이엑스아이디)] 위아래 (UP&DOWN) MV.
  2. However, the format string ('{ts}: free form request {free_form_request} requested from {nick}') is a unicode string (because of imported from __future__ import unicode_literals).
  3. You can't use byte strings as format arguments for a unicode string, so Python attempts to decode the byte string created in 1. to create a unicode string (which would be valid as an format argument).
  4. Python attempts the decode-ing using the default encoding, which is ascii, and fails, because the byte string is a utf-8 byte string that includes byte values that don't make sense in ascii.
  5. Python throws a UnicodeDecodeError.

Note that while the code is obviously doing something here, this would actually not throw an exception on Python 3, which would instead substitute the repr of the byte string (the repr being a unicode string).


To fix your issue, just pass unicode strings to format.

That is, don't do step 1. where you encoded free_form_request as a byte string: keep it as a unicode string by removing .encode(...):

'{ts}: free form request {free_form_request} requested from {nick}'.format(
    ts=ts, 
    free_form_request=free_form_request, 
    nick=nick)

Note Padraic Cunningham's answer in the comments as well.

Thomas Orozco
  • 53,284
  • 11
  • 113
  • 116
  • I have the block of code from the original question in a larger function. It turns out something I had imported re-writes the print function and can't handle unicode strings properly. Unfortunately, my workaround is encode('ascii', errors='ignore') and drop the unicode characters and just deal with that. – Preston Connors May 15 '15 at 13:41
  • @PrestonConnors Then you should encode the formatted string, not the format arguments. Just do: `print('...'.format(...).encode('utf-8'))` – Thomas Orozco May 15 '15 at 13:42
  • While the original code would not throw an exception on Python 3, Python 3 doesn't implicitly decode the byte string back to Unicode (step 3), so what is printed is the `repr` of the byte string encoding. To correctly work on Python 3 the `encode` would still have to be removed. – Mark Tolonen May 15 '15 at 15:42