1

I am attempting to make a CSV of some data from a database in order to move it to the cloud in data warehousing. However, when I run it, it always quits after 36,599 rows and gives me

UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 62: character maps to <undefined>

I found that the string causing the issue is 'Rejected-Case No. doesn’t match' and I assume it is a problem with the apostrophe. I do not know why it is causing this issue and have been unable to find a way around it. Does anyone know how to solve this? The code I use is:

db = pymysql.connect(host='host', port=3306, user="user", passwd="secret", 
db="db", autocommit=True)
cur = db.cursor()
#cur.execute("call inv1_view_prod.`Email_agg`")

cur.execute("""select fields from table""") 


emails = cur.fetchall()
with open('O:\file\path\to\File_name.csv','w') as fileout:
        writer = csv.writer(fileout)
        writer.writerows(emails)   
time.sleep(1)
A. Pearson
  • 191
  • 6
  • 20

1 Answers1

2

As you have not shown the code that causes the error, I am just guessing.

The only fact is that this string 'Rejected-Case No. doesn’t match' contains a "’" which is the unicode character U+2019, RIGHT SINGLE QUOTATION MARK. In the windows cp1252 code page, this character has indeed the code 0x92.

It looks like you have somewhere a byte string encoded in cp1252 charset that has not been correctly decoded to a unicode string.

What should be done:

There are solutions. Unfortunately, they will depend on the Python version that you are using (2 or 3), and without knowing anything of the code I can only give the generic advices:

  • identify the input charset (what the database gives to the Python script)
  • identify the output charset (what you want to write in the CSV module)
  • use explicit conversions to be able to pass the correct charsets
  • optionaly use error=replace in the encoding/decoding calls to avoid the UnicodeError exceptions.

If you use Python3, I will assume that you have a problem in decoding unicode from the database. The RIGHT SINGLE QUOTATION MARK has unicode code U+2019, but in the string given to Python is coded '\x92' which is the cp1252 byte encoding. A quick and dirty fix is to force an encoding/decoding pass to get a correct unicode string. Your code could become:

db = pymysql.connect(host='host', port=3306, user="user", passwd="secret", 
db="db", autocommit=True)
cur = db.cursor()
#cur.execute("call inv1_view_prod.`Email_agg`")

cur.execute("""select fields from table""") 

charset = 'cp1252'   # or 'utf8' depending on what you want in the csv file
with open('O:\file\path\to\File_name.csv','w', encoding=charset,
           errors='replace', newline='') as fileout:
        writer = csv.writer(fileout)
        for row in cur.fetchall():
            writer.writerow([field.encode('latin1').decode('cp1252', errors='replace')
                for field in row])

The encode('latin1').decode('cp1252') is just a trick to fix a Python3 string where characters have the code of a byte encoding. It works because latin1 encoding is a no-op for all codes under 256.

The errors=replace option, ask Python to never raise an UnicodeError exception but instead to replace the offending character with a '?' for a byte string or with the official unicode REPLACEMENT CHARACTER U+FFFD '�' for an unicode string.


It would probably be cleaner to use the charset option of pymysql.connect. Unfortunately, I have never used MySQL databases from Python...

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252