Why is my_str.decode('utf-8') still failing?

Question

I believe in the unicode sandwich. I use the unicode sandwich. So why is it that when I run the following on a byte string (py 2.7)...

label = label.decode("utf-8")

I still get an error:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 648, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/celery/cl/scrapers/tasks.py", line 638, in update_docket_info_iquery
    d = update_docket_metadata(d, report.metadata)
  File "/usr/local/lib/python2.7/site-packages/juriscraper/pacer/case_query.py", line 166, in metadata
    self._get_label_value_pair(bold, True, field_names)
  File "/usr/local/lib/python2.7/site-packages/juriscraper/pacer/docket_report.py", line 233, in _get_label_value_pair

    label = label.decode("utf-8") <---- Shouldn't this work?

  File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

And, why is this throwing a UnicodeEncodeError when I'm trying to do a decode on the line that crashes?

I'm confused. Again.

You got a `UnicodeEncodeError` indicating `label` was already a Unicode string. Python 2.7 implicitly encodes it back to a byte string using the default `ascii` codec before trying to decode it to UTF-8, and that implicit encode fails due to non-ASCII characters in the string. This is one of the things Python 3 fixes. — Mark Tolonen, May 27 '20 at 06:28

score 0 · Answer 1 · answered May 27 '20 at 06:35

Your log shows the answer:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

Python 2.7 cannot decode a character in your string because it is a non-ASCII character. The solution here is to work entirely in unicode, or to encode it first then decode it with the proper codec.

Question is possible duplicate of: UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

Why is my_str.decode('utf-8') still failing?

1 Answers1