I have a python script and recently noticed that I was hitting some encoding errors on certain input. I noticed that "smart quotes" were causing problems. I'd like to know advice on how to overcome this. I am using Python 2
, so need to tell my script that I want to encode everything in UTF-8.
I thought doing this was enough:
mystring.encode("utf-8")
and largely it worked, until I came across smart quotes (and there are possibly many other things that will cause problems, hence why I'm posting here.) For example:
mystring = "hi"
mystring.encode("utf-8")
output is
'hi'
But for this:
mystring2 = "’"
mystring.encode("utf-8")
output is
UnicodeDecodeError
Traceback (most recent call last)
<ipython-input-21-f563327dcd27> in <module>()
----> 1 mystring.encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in
position 0: ordinal not in range(128)
I created a function to handle the JSON input I get (sometimes I get null/None
values, and sometimes numeric values, although mostly unicode, hence why i have the couple of if statements):
def xstr(s):
if s is None:
return ''
if isinstance(s, basestring):
return str(s.encode("utf-8"))
else:
return str(s)
This has worked quite well (until this smart quotes issue)
The two questions I have are:
Why can't "smart quotes" be encoded in UTF-8, and are there other limitations of UTF-8 or am I completely misinterpreting what I am seeing?
Is the approach I have used (ie using my custom function) the best way to handle this? I tried using a try/except to catch the cases of smart quotes, but that didn't work.