0

I have a python script and recently noticed that I was hitting some encoding errors on certain input. I noticed that "smart quotes" were causing problems. I'd like to know advice on how to overcome this. I am using Python 2, so need to tell my script that I want to encode everything in UTF-8.


I thought doing this was enough:

mystring.encode("utf-8")

and largely it worked, until I came across smart quotes (and there are possibly many other things that will cause problems, hence why I'm posting here.) For example:

mystring = "hi"
mystring.encode("utf-8")

output is

'hi'

But for this:

mystring2 = "’"
mystring.encode("utf-8")

output is

UnicodeDecodeError
  Traceback (most recent call last)
    <ipython-input-21-f563327dcd27> in <module>()
    ----> 1 mystring.encode("utf-8")
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in
  position 0: ordinal not in range(128)

I created a function to handle the JSON input I get (sometimes I get null/None values, and sometimes numeric values, although mostly unicode, hence why i have the couple of if statements):

def xstr(s):
    if s is None:
        return ''
    if isinstance(s, basestring):
        return str(s.encode("utf-8"))
    else:
        return str(s)

This has worked quite well (until this smart quotes issue)

The two questions I have are:

  1. Why can't "smart quotes" be encoded in UTF-8, and are there other limitations of UTF-8 or am I completely misinterpreting what I am seeing?

  2. Is the approach I have used (ie using my custom function) the best way to handle this? I tried using a try/except to catch the cases of smart quotes, but that didn't work.

Ralf
  • 16,086
  • 4
  • 44
  • 68
Calamari
  • 29
  • 1
  • 9
  • followup question, because I have been trying different things, should i be "decoding" THEN "encoding"? ie mystring.decode("utf-8").encode("utf-8") – Calamari Nov 02 '18 at 10:02
  • 1
    Since this is Python 2, your source strings are bytestrings. But `encode` is for going from unicode *to* bytestrings, so Python automatically tries to *decode* first, using the default ASCII decoder - hence the error. But really if this is a bytestring to start with you shouldn't need to be doing this at all. – Daniel Roseman Nov 02 '18 at 10:08
  • Thanks @DanielRoseman. The problem is that if I don't use `encode` then i receive errors like this _'ascii' codec can't encode character u'\u2019' in position 190: ordinal not in range(128)_ – Calamari Nov 02 '18 at 10:17
  • Where do you get that? Where is your actual code? – Daniel Roseman Nov 02 '18 at 10:27

1 Answers1

0

Python cannot encode the string because it doesn't know its current encoding. You'll need to use u"’" in Python 2 to tell Python that this is a Unicode string. ("\xe2" happens to be the first byte of the UTF-8 encoding of this character, but Python doesn't know that it's in UTF-8 because you haven't told it. You could put a -*- coding: utf-8 -*- comment near the top of your file; or unambiguously represent the character as u"\u2219".)

Similarly, to convert a string you read from disk, you have to coerce into Unicode so that you can then encode as UTF-8.

print(s.decode('iso-8859-1').encode('utf-8'))

Here, of course, 'iso-8859-1' is just a random guess. You have to know the encoding, or risk getting incorrect output.

tripleee
  • 175,061
  • 34
  • 275
  • 318