Python MySQL UTF-8 encoding differs depending on order of execution

Question

I recently inherited a python project and I've got some behavior I'm struggling to account for.

The code has two sections, it can import a file into the database, or it can dump the database to an output file. The import looks something like this:

def importStuff(self):
    mysqlimport_args = ['mysqlimport', '--host='+self.host, '--user='+self.username, '--password='+self.password, '--fields-terminated-by=|', '--lines-terminated-by=\n', '--replace', '--local', self.database, filename, '-v']
    output = check_output(mysqlimport_args)

The dump looks like this:

def getStuff(self):
    db = MySQLdb.connect(self.host, self.username, self.password, self.database)
    cursor = db.cursor()
    sql = 'SELECT somestuff'
    cursor.execute(sql)
    records = cursor.fetchall()
    cursor.close()
    db.close()
    return records

def toCsv(self, records, csvfile):
    f = open(csvfile, 'wb')
    writer = csv.writer(f, quoting=csv.QUOTE_ALL)
    writer.writerow(['StuffId'])
    count = 1
    for record in records:
        writer.writerow([record[0]])

    f.close()

Okay not the prettiest python you'll ever see (style comments welcome as I'd love to learn more) but it seems reasonable.

But, I got a complaint from a consumer that my output wasn't in UTF-8 (the mysql table is using utf8 encoding by the way). Here's where I get lost, if the program executes like this:

importStuff(...)

getStuff(...)

toCsv(...)

Then the output file doesn't appear to be valid utf-8. When I break the execution into two different steps

importStuff(...)

then in another file

getStuff(...)

toCsv(...)

Suddenly my output appears as valid utf-8. Aside from the fact that I have a work around, I can't seem to explain this behavior. Can anyone shed some light on what I'm doing wrong here? Or is there more information I can provide that might clarify what's going on?

Thanks.

(python 2.7 in case that factors in)

EDIT: More code as requested. I've made some minor tweaks to protect the innocent such as my company, but it's more or less here:

def main():

    dbutil = DbUtil(config.DB_HOST, config.DB_DATABASE, config.DB_USERNAME, config.DB_PASSWORD)
    if(args.import):
        logger.info('Option: --import')

        try:
            dbutil.mysqlimport(AcConfig.DB_FUND_TABLE)
        except Exception, e:
            logger.warn("Error occured at mysqlimport. Error is %s" % (e.message))

    if(args.db2csv):
        try:
            logger.info('Option: --db2csv')
            records = dbutil.getStuff()
            fileutil.toCsv(records, csvfile)
        except Exception, e:
            logger.warn("Error Occured at db2csv. Message:%s" %(e.message))

main()

And that's about it. It's really short which is making this much less obvious.

The output I'm not sure how to faithfully represent, it looks something like this:

"F0NR006F8F"

They all look like more or less ASCII characters to me, so I'm not sure what problem they could be creating. Maybe I'm approaching this from the wrong angle, I'm currently relying on my text editor's best guess for what encoding a file is in. I'm not sure how I could best detect which character is causing it to stop reading my file as utf-8.

Can you provide an example of two different pieces of code, along with the two different results they produce? — BrenBarn, Nov 12 '12 at 20:59
I think there is some other wrinkle you're not pinning down. The difference your are noticing should not cause the encoding to change. — Steven Rumbalski, Nov 12 '12 at 21:04
Possible duplicate of: [Writing UTF-8 String to MySQL with Python](http://stackoverflow.com/questions/6202726/writing-utf-8-string-to-mysql-with-python). I believe you need: `MySQLdb.connect(use_unicode=True, charset='utf8')` arguments in your `MySQLdb.connect` call (`use_unicode=True` is not strictly necessary and is implicit if `charset='utf8'`). — Pedro Romano, Nov 12 '12 at 21:13
@PedroRomano I tried adding that flag but I have the same problem, maybe I need to come at this from a different direction. Is there a way I can try to determine the exact character that my editor is telling me isn't utf-8? — Dio, Nov 12 '12 at 21:34
@PedroRomano I deleted the exception comment, it was throwing here: writer.writerow([record[0]]) because I tried changing it to writer.writerow([unicode(record[0],"utf-8")]) which was not the right solution and then I forgot to revert it, my apologies. — Dio, Nov 12 '12 at 21:35
@BrenBarn I tried to add it above in the edit, there's disturbingly little going on... I agree with both you and Steven Rumbalski, but I'm not sure what else I'm not capturing... — Dio, Nov 12 '12 at 21:40
@Dio: This is just a guess. `getStuff()` may be returning `unicode` objects. The Python2 `csv` module [does not support unicode input](http://docs.python.org/2/library/csv.html) (see the Note). Therefore, you have to encode the row values yourself. If my guess is correct, then try `writer.writerow([record[0].encode('utf-8')])`. — unutbu, Nov 12 '12 at 21:44

score 0 · Answer 1 · answered Nov 12 '12 at 22:23

Dumbest answer of all time. The input data wasn't in UTF-8. Someone solved this by writing another sproc that would be called periodically to convert the non-utf-8 characters to utf-8. In the time it took me to break my code into two files and run them separately, the job ran. It just happened to run that way the 4-5 times I tried it leading to a false conclusion on my part. I'm now changing the read process to accommodate a non-utf-8 input source so I don't have a weird race condition hiding in the system. Sorry to have lead you all on this goosechase.

Python MySQL UTF-8 encoding differs depending on order of execution

1 Answers1