3

I am trying to write data in a StringIO object using Python and then ultimately load this data into a postgres database using psycopg2's copy_from() function.

First when I did this, the copy_from() was throwing an error: ERROR: invalid byte sequence for encoding "UTF8": 0xc92 So I followed this question.

I figured out that my Postgres database has UTF8 encoding.

The file/StringIO object I am writing my data into shows its encoding as the following: setgid Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators

I tried to encode every string that I am writing to the intermediate file/StringIO object into UTF8 format. To do this used .encode(encoding='UTF-8',errors='strict')) for every string.

This is the error I got now: UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

What does it mean? How do I fix it?

EDIT: I am using Python 2.7 Some pieces of my code:

I read from a MySQL database that has data encoded in UTF-8 as per MySQL Workbench. This is a few lines code for writing my data (that's obtained from MySQL db) to StringIO object:

# Populate the table_data variable with rows delimited by \n and columns delimited by \t
row_num=0
for row in cursor.fetchall() :

    # Separate rows in a table by new line delimiter
    if(row_num!=0):
        table_data.write("\n")

    col_num=0
    for cell in row:    
        # Separate cells in a row by tab delimiter
        if(col_num!=0):
            table_data.write("\t") 

        table_data.write(cell.encode(encoding='UTF-8',errors='strict'))
        col_num = col_num+1

    row_num = row_num+1   

This is the code that writes to Postgres database from my StringIO object table_data:

cursor = db_connection.cursor()
cursor.copy_from(table_data, <postgres_table_name>)
Community
  • 1
  • 1
user3422637
  • 3,967
  • 17
  • 49
  • 72
  • Show your code please – Craig Ringer Oct 28 '14 at 23:17
  • Which MySQL wrapper are you using? – abarnert Oct 28 '14 at 23:59
  • Also, instead of showing "some pieces of my code", create a self-contained [minimal, complete, verifiable example](http://stackoverflow.com/help/mcve) and post that here. – abarnert Oct 29 '14 at 00:02
  • I am using MySQLdb as the python package to retrieve data from MySQL – user3422637 Oct 29 '14 at 00:28
  • Meanwhile, after your updates and responses, there's still no MCVE, and other critical information is only visible in comments, which means anyone searching because they want to help people like you or because they have a similar problem won't see it. – abarnert Oct 29 '14 at 22:34
  • I will try to add some more code. I can't give a lot more code than this because of security reasons. – user3422637 Oct 29 '14 at 22:37
  • @user3422637: Read the linked help. Nobody wants your real code, they want a minimal, complete, verifiable example that demonstrates the problem. – abarnert Oct 29 '14 at 23:35

1 Answers1

7

The problem is that you're calling encode on a str object.

A str is a byte string, usually representing text encoded in some way like UTF-8. When you call encode on that, it first has to be decoded back to text, so the text can be re-encoded. By default, Python does that by calling s.decode(sys.getgetdefaultencoding()), and getdefaultencoding() usually returns 'ascii'.

So, you're talking UTF-8 encoded text, decoding it as if it were ASCII, then re-encoding it in UTF-8.

The general solution is to explicitly call decode with the right encoding, instead of letting Python use the default, and then encode the result.

But when the right encoding is already the one you want, the easier solution is to just skip the .decode('utf-8').encode('utf-8') and just use the UTF-8 str as the UTF-8 str that it already is.

Or, alternatively, if your MySQL wrapper has a feature to let you specify an encoding and get back unicode values for CHAR/VARCHAR/TEXT columns instead of str values (e.g., in MySQLdb, you pass use_unicode=True to the connect call, or charset='UTF-8' if your database is too old to auto-detect it), just do that. Then you'll have unicode objects, and you can call .encode('utf-8') on them.

In general, the best way to deal with Unicode problems is the last one—decode everything as early as possible, do all the processing in Unicode, and then encode as late as possible. But either way, you have to be consistent. Don't call str on something that might be a unicode; don't concatenate a str literal to a unicode or pass one to its replace method; etc. Any time you mix and match, Python is going to implicitly convert for you, using your default encoding, which is almost never what you want.

As a side note, this is one of the many things that Python 3.x's Unicode changes help with. First, str is now Unicode text, not encoded bytes. More importantly, if you have encoded bytes, e.g., in a bytes object, calling encode will give you an AttributeError instead of trying to silently decode so it can re-encode. And, similarly, trying to mix and match Unicode and bytes will give you an obvious TypeError, instead of an implicit conversion that succeeds in some cases and gives a cryptic message about an encode or decode you didn't ask for in others.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I tried passing in charset='UTF-8' in my MYSQL DB connection with use_unicode=True. Also realized that some of the tables in source MySQL database are latin1_swedish_ci while some are utf_8. I get an error like this now: db_connection = MySQLdb.connect(host=host,user=user,passwd=passwd,db=db, charset="utf8", use_unicode=True) cell = str(cell).replace("\r", " ").replace("\n", " ").replace("\t", '').replace("\"", "") UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 47: ordinal not in range(128) – user3422637 Oct 29 '14 at 20:49
  • @user3422637: OK, if you want to use `unicode` objects instead of `str`—which is a good idea—then you can't go calling `str` on them, because that will immediately try to encode them to your default charset. And you shouldn't mix and match `unicode` and `str` objects like you're doing, because that also has to implicitly encode or decode one or the other; use `replace(u"\r", u" ")` and the like. (Actually, `"\r"` is bad in another way—if you have backslashes, either escape them, or use a raw string literal.) – abarnert Oct 29 '14 at 22:27
  • @user3422637: More generally, if you're trying to bash on things until they work without actually understanding it, you're just going to get more and more lost. If you haven't read the [Unicode HOWTO](https://docs.python.org/2/howto/unicode.html), read that. – abarnert Oct 29 '14 at 22:28
  • @user3422637: Finally, depending on your version of MySQL and MySQLdb, you may still have problems with those `latin1_swedish_ci` tables (with newer versions, the problems may go away if you leave off the `charset=` and let it auto-detect, but they may not). But don't worry about that unless/until you get there; first get your code working with the UTF-8 tables, then test on the Swedish ones. – abarnert Oct 29 '14 at 22:29
  • 1
    Thanks a lot for your continuous help. So, I figured out that ultimately, I need to write the data in str format because I am writing it to StringIO object (which then copies data to Postgres using copy_from). However, I had to deal with all those unicode characters like u'u\2018', u'\xc9' etc that couldn't be handled by the str() function. Hence I imported the unicode python package and used unicode() function to convert all these unicode characters to the closes ASCII characters. Then I did an str() on that. This solved my problems. Thanks. – user3422637 Oct 29 '14 at 23:00
  • @user3422637: There's a whole lot more confusion in that comment. I don't know what `unicode` package you imported, but you shouldn't need any such thing. I also don't know why you're trying to call `str` on anything. Before, you already had a `str`, so it was pointless. Now, you have a `unicode`, and the way to encode one of those to `str` is with the `encode` method. And you definitely don't have to throw away accent marks and curly quotes and emoji and so on just to copy your data from one database to another. Again, please read the HOWTO. – abarnert Oct 29 '14 at 23:38
  • @user3422637: One more thing: Notice that [`copy_from`](http://initd.org/psycopg/docs/cursor.html#cursor.copy_from) can take Unicode files: "data read from files implementing the io.TextIOBase interface are encoded in the connection encoding when sent to the backend". So, if you just use `io.StringIO` instead of `StringIO.StringIO`, you don't need to encode at all. MySQLdb gives you Unicode, you process it as Unicode, you feed it to psycopg as Unicode, all your programs go away. – abarnert Oct 29 '14 at 23:40