0

I'm writing a Python script that reads tweets and inserts them into MySQL. Depending on the attributes of each tweet, I need to insert different fields. For that reason, I'm building the fields and values section of the query string as I go, using Python string formatting for convenience:

values = """%s, %s, '%s','%s','%s','%s',%s,'%s','%s','%s'""" % (
                url_id, tweet['from_user_id'], conn.escape_string(tweet['location']),
                conn.escape_string(tweet['profile_image_url']),
                tweet['created_at'], tweet['from_user'], tweet['id'],
                conn.escape_string(tweet['text']),
                conn.escape_string(tweet['iso_language_code']), conn.escape_string(tweet['source'])
            )

When I do this with tweets that have UTF8 characters, though, I get an error like this:

values = """%s, %s, '%s','%s','%s','%s',%s,'%s','%s','%s'""" % (
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 117: ordinal not in range(128)

I think that the format string (the one with all the "%s"s) is interpreted as ASCII by default, and that's clashing with the UTF-8 characters. I need to keep everything in UTF-8, since this code has to work with any possible language.

So how do I specify that the formatting string is UTF-8? I thought I could change the default encoding for the entire script, but I'm using Python 2.4 and sys.setdefaultencoding doesn't exist in that version. Right now, I'm just not sure how to do that, or if that's even the right thing to do.

Dave Shepard
  • 517
  • 2
  • 12

1 Answers1

3

Change:

"""%s, %s, '%s','%s','%s','%s',%s,'%s','%s','%s'"""

to:

u"""%s, %s, '%s','%s','%s','%s',%s,'%s','%s','%s'"""

And then if you want to encode it to UTF-8, do:

value.encode('utf8')

But it looks like you're using the wrong approach anyway, see Escape string Python for MySQL

Community
  • 1
  • 1
bradley.ayers
  • 37,165
  • 14
  • 93
  • 99
  • Thanks! Actually, I tried that earlier, and I still get the same error. It still tries to encode everything in ASCII. – Dave Shepard Jun 15 '11 at 03:20
  • No -- good point. Now it's values = u"""%s, %s, '%s','%s','%s','%s',%s,'%s','%s','%s'""" % ( UnicodeEncodeError: 'ascii' codec can't encode characters in position 81-82: ordinal not in range(128) – Dave Shepard Jun 15 '11 at 03:25
  • That worked! Thank you so much. I knew I was doing something basic that was wrong. – Dave Shepard Jun 15 '11 at 03:41