Python UnicodeDecodeError on smart quotes

Question

I have a python script and recently noticed that I was hitting some encoding errors on certain input. I noticed that "smart quotes" were causing problems. I'd like to know advice on how to overcome this. I am using Python 2, so need to tell my script that I want to encode everything in UTF-8.

I thought doing this was enough:

mystring.encode("utf-8")

and largely it worked, until I came across smart quotes (and there are possibly many other things that will cause problems, hence why I'm posting here.) For example:

mystring = "hi"
mystring.encode("utf-8")

output is

'hi'

But for this:

mystring2 = "’"
mystring.encode("utf-8")

output is

UnicodeDecodeError
  Traceback (most recent call last)
    <ipython-input-21-f563327dcd27> in <module>()
    ----> 1 mystring.encode("utf-8")
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in
  position 0: ordinal not in range(128)

I created a function to handle the JSON input I get (sometimes I get null/None values, and sometimes numeric values, although mostly unicode, hence why i have the couple of if statements):

def xstr(s):
    if s is None:
        return ''
    if isinstance(s, basestring):
        return str(s.encode("utf-8"))
    else:
        return str(s)

This has worked quite well (until this smart quotes issue)

The two questions I have are:

Why can't "smart quotes" be encoded in UTF-8, and are there other limitations of UTF-8 or am I completely misinterpreting what I am seeing?
Is the approach I have used (ie using my custom function) the best way to handle this? I tried using a try/except to catch the cases of smart quotes, but that didn't work.

followup question, because I have been trying different things, should i be "decoding" THEN "encoding"? ie mystring.decode("utf-8").encode("utf-8") — Calamari, Nov 02 '18 at 10:02
Since this is Python 2, your source strings are bytestrings. But `encode` is for going from unicode *to* bytestrings, so Python automatically tries to *decode* first, using the default ASCII decoder - hence the error. But really if this is a bytestring to start with you shouldn't need to be doing this at all. — Daniel Roseman, Nov 02 '18 at 10:08
Thanks @DanielRoseman. The problem is that if I don't use `encode` then i receive errors like this _'ascii' codec can't encode character u'\u2019' in position 190: ordinal not in range(128)_ — Calamari, Nov 02 '18 at 10:17

tripleee · Accepted Answer · 2018-11-02T10:29:02.610

Python cannot encode the string because it doesn't know its current encoding. You'll need to use u"’" in Python 2 to tell Python that this is a Unicode string. ("\xe2" happens to be the first byte of the UTF-8 encoding of this character, but Python doesn't know that it's in UTF-8 because you haven't told it. You could put a -*- coding: utf-8 -*- comment near the top of your file; or unambiguously represent the character as u"\u2219".)

Similarly, to convert a string you read from disk, you have to coerce into Unicode so that you can then encode as UTF-8.

print(s.decode('iso-8859-1').encode('utf-8'))

Here, of course, 'iso-8859-1' is just a random guess. You have to know the encoding, or risk getting incorrect output.

Python UnicodeDecodeError on smart quotes

1 Answers1