how to fix character maps to undefined unicode error

Question

I am attempting to make a CSV of some data from a database in order to move it to the cloud in data warehousing. However, when I run it, it always quits after 36,599 rows and gives me

UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 62: character maps to <undefined>

I found that the string causing the issue is 'Rejected-Case No. doesn’t match' and I assume it is a problem with the apostrophe. I do not know why it is causing this issue and have been unable to find a way around it. Does anyone know how to solve this? The code I use is:

db = pymysql.connect(host='host', port=3306, user="user", passwd="secret", 
db="db", autocommit=True)
cur = db.cursor()
#cur.execute("call inv1_view_prod.`Email_agg`")

cur.execute("""select fields from table""") 


emails = cur.fetchall()
with open('O:\file\path\to\File_name.csv','w') as fileout:
        writer = csv.writer(fileout)
        writer.writerows(emails)   
time.sleep(1)

It probably is a \ instead of an apostrophe that is causing the issue. — dfundako, Jul 02 '18 at 14:04
@dfundako: `'\x92'` is the RIGHT SINGLE QUOTATION MARK (`’`)in charset cp1252 — Serge Ballesta, Jul 02 '18 at 15:15

Serge Ballesta · Answer 1 · 2018-07-02T16:10:51.123

As you have not shown the code that causes the error, I am just guessing.

The only fact is that this string 'Rejected-Case No. doesn’t match' contains a "’" which is the unicode character U+2019, RIGHT SINGLE QUOTATION MARK. In the windows cp1252 code page, this character has indeed the code 0x92.

It looks like you have somewhere a byte string encoded in cp1252 charset that has not been correctly decoded to a unicode string.

What should be done:

There are solutions. Unfortunately, they will depend on the Python version that you are using (2 or 3), and without knowing anything of the code I can only give the generic advices:

identify the input charset (what the database gives to the Python script)
identify the output charset (what you want to write in the CSV module)
use explicit conversions to be able to pass the correct charsets
optionaly use error=replace in the encoding/decoding calls to avoid the UnicodeError exceptions.

If you use Python3, I will assume that you have a problem in decoding unicode from the database. The RIGHT SINGLE QUOTATION MARK has unicode code U+2019, but in the string given to Python is coded '\x92' which is the cp1252 byte encoding. A quick and dirty fix is to force an encoding/decoding pass to get a correct unicode string. Your code could become:

db = pymysql.connect(host='host', port=3306, user="user", passwd="secret", 
db="db", autocommit=True)
cur = db.cursor()
#cur.execute("call inv1_view_prod.`Email_agg`")

cur.execute("""select fields from table""") 

charset = 'cp1252'   # or 'utf8' depending on what you want in the csv file
with open('O:\file\path\to\File_name.csv','w', encoding=charset,
           errors='replace', newline='') as fileout:
        writer = csv.writer(fileout)
        for row in cur.fetchall():
            writer.writerow([field.encode('latin1').decode('cp1252', errors='replace')
                for field in row])

The encode('latin1').decode('cp1252') is just a trick to fix a Python3 string where characters have the code of a byte encoding. It works because latin1 encoding is a no-op for all codes under 256.

The errors=replace option, ask Python to never raise an UnicodeError exception but instead to replace the offending character with a '?' for a byte string or with the official unicode REPLACEMENT CHARACTER U+FFFD '�' for an unicode string.

It would probably be cleaner to use the charset option of pymysql.connect. Unfortunately, I have never used MySQL databases from Python...

[_Best practice and debugger_](https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored) and [_Python tips_](http://mysql.rjweb.org/doc.php/charcoll#python) — Rick James, Jul 02 '18 at 17:59

how to fix character maps to undefined unicode error

1 Answers1