18

I am at a scenario where I call api and based on the results from api I call database for each record that I in api. My api call return strings and when I make the database call for the items return by api, for some elements I get the following error.

Traceback (most recent call last):
  File "TopLevelCategories.py", line 267, in <module>
    cursor.execute(categoryQuery, {'title': startCategory});
  File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/cursors.py", line 158, in execute
    query = query % db.literal(args)
  File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 265, in literal
    return self.escape(o, self.encoders)
  File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 203, in unicode_literal
    return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)

The segment of my code the above error is referring is:

         ...    
         for startCategory in value[0]:
            categoryResults = []
            try:
                categoryRow = ""
                baseCategoryTree[startCategory] = []
                #print categoryQuery % {'title': startCategory}; 
                cursor.execute(categoryQuery, {'title': startCategory}) #unicode issue
                done = False
                cont...

After doing some google search I tried the following on my command line to understand whats going on...

>>> import sys
>>> u'\u2013'.encode('iso-8859-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 0: ordinal not in range(256)
>>> u'\u2013'.encode('cp1252')
'\x96'
>>> '\u2013'.encode('cp1252')
'\\u2013'
>>> u'\u2013'.encode('cp1252')
'\x96'

But I am not sure what would be the solution to overcome this issue. Also I don't know what is the theory behind encode('cp1252') it would be great if I can get some explanation on what I tried above.

add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
  • 1
    Possible duplicate of [UnicodeEncodeError: 'latin-1' codec can't encode character](http://stackoverflow.com/questions/3942888/unicodeencodeerror-latin-1-codec-cant-encode-character) – ivan_pozdeev Jan 23 '17 at 23:21

3 Answers3

22

If you need Latin-1 encoding, you have several options to get rid of the en-dash or other code points above 255 (characters not included in Latin-1):

>>> u = u'hello\u2013world'
>>> u.encode('latin-1', 'replace')    # replace it with a question mark
'hello?world'
>>> u.encode('latin-1', 'ignore')     # ignore it
'helloworld'

Or do your own custom replacements:

>>> u.replace(u'\u2013', '-').encode('latin-1')
'hello-world'

If you aren't required to output Latin-1, then UTF-8 is a common and preferred choice. It is recommended by the W3C and nicely encodes all Unicode code points:

>>> u.encode('utf-8')
'hello\xe2\x80\x93world'
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • This solves my problem.To add in this answer use decode as well because encode converts it to bytes. I used this: text.encode('latin-1', 'ignore').decode('latin-1') – Akshay Jan 23 '23 at 21:47
3

The unicode character u'\02013' is the "en dash". It is contained in the Windows-1252 (cp1252) character set (with the encoding x96), but not in the Latin-1 (iso-8859-1) character set. The Windows-1252 character set has some more characters defined in the area x80 - x9f, among them the en dash.

The solution would be for you to choose a different target character set than Latin-1, such as Windows-1252 or UTF-8, or to replace the en dash with a simple "-".

Cito
  • 5,365
  • 28
  • 30
1

u.encode('utf-8') converts it to bytes which can then be printed on stdout using sys.stdout.buffer.write(bytes) checkout the displayhook on https://docs.python.org/3/library/sys.html

Vaibhav Mule
  • 5,016
  • 4
  • 35
  • 52
PriyankaP
  • 109
  • 6