2

I have a small Python program that reads in SQL statements from a file and runs them on a MySQL database. The file is encoded in UTF-8 and the database also uses UTF-8.

If I don't set the database encoding I get the usual error that everyone asks about "'latin-1' codec can't encode character...". So I set the database and file encoding using

con.set_character_set('utf8')
fh = codecs.open(fname,'r','utf8')

Now it works, but it also works when i don't set the file encoding (or just use the builtin open), just in on the the database. By "works" I mean that the resulting database records display properly in WordPress which assumes UTF-8.

If I wanted magic, I'd code in Ruby. What is Python doing in this case and why was it not necessary to tell it the file encoding?

Needless to say I've done a lot of searching on this, and my Google-foo is usually pretty good. There are tons of posts here and in blogs on why it is necessary to set the encoding and how to do it, but I haven't found any on why it sometimes just works.

Edit: I ran a simple test on this using a file containing “Thank you.”

file
  E2 80 9C 54 68 61 6E 6B 20 79 6F 75 2E E2 80 9D
codecs utf8
  201C 54 68 61 6E 6B 20 79 6F 75 2E 201D

Attempting to read it with codecs.open(myfile,'r','ascii') returned "UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2"

The read from file produced a byte string, so it appears that the magic is occurring on the insert into the database.

Peter Wooster
  • 6,009
  • 2
  • 27
  • 39
  • 1
    are you doing anything with the content of the file besides feeding it to MySQL? Python can read in UTF8 just fine with regular old open. in my experience it's when you try to write it back out that usually get the usual 'latin-1 codec can't encode' error. – Anov Jan 16 '13 at 22:35
  • I'm giving the resulting database to WordPress which assumes it's UTF8. When it works properly the text displays properly, when it doesn't the text shows a lot of strange characters. It's "reading it just fine with regular old open" that has me confused as I thought that the default encoding was ISO 8859-1. – Peter Wooster Jan 16 '13 at 22:49
  • @anov, thanks, I've added the definition of "works" to the question. – Peter Wooster Jan 16 '13 at 22:53

2 Answers2

1

When you use

fh = codecs.open(fname,'r','utf8')

fh.read() returns a unicode. If you take this unicode and use your database driver (such as mysql-python) to insert data into your database, then the driver is responsible for converting the unicode into bytes. The driver is using the encoding set by

con.set_character_set('utf8')

If you use

fh = open(fname, 'r')

then fh.read() returns a string of bytes. You are at the mercy of whatever bytes happened to be in fname. Fortunately, according to your post, the file is encoded in UTF-8. Since the data is already a string of bytes, the driver does not perform any encoding, and simply communicates the string of bytes as is to the database.

Either way, the same string of UTF-8 encoded bytes gets inserted into the database.


Let's take a look at the source code defining codecs.open:

def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):

    if encoding is not None:
        if 'U' in mode:
            # No automatic conversion of '\n' is done on reading and writing
            mode = mode.strip().replace('U', '')
            if mode[:1] not in set('rwa'):
                mode = 'r' + mode
        if 'b' not in mode:
            # Force opening of the file in binary mode
            mode = mode + 'b'
    file = __builtin__.open(filename, mode, buffering)
    if encoding is None:
        return file
    info = lookup(encoding)
    srw = StreamReaderWriter(file, info.streamreader, info.streamwriter, errors)
    # Add attributes to simplify introspection
    srw.encoding = encoding
    return srw

Notice in particular what happens if no encoding is set:

file = __builtin__.open(filename, mode, buffering)
if encoding is None:
     return file

So codecs.open is essentially the same as the builtin open when no encoding is set. The builtin open returns a file object whose read method returns a str object. It does no decoding at all.

In contrast, when you specify an encoding codecs.open returns a StreamReaderWriter with srw.encoding set to encoding. Now when you call the StreamReaderWriter's read method, a unicode object is returned -- usually. First the str object must be decoded using the specified encoding.

In your example, the str object is

In [19]: content
Out[19]: '\xe2\x80\x9cThank you.\xe2\x80\x9d'

and if you specify the encoding as 'ascii', then the StreamReaderWriter tries to decode content using the 'ascii' encoding:

In [20]: content.decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

That's not surprising since the ascii encoding can only decode bytes in the range 0--127, and '\xe2', the first byte in content, has ordinal value outside that range.


For concreteness: When you don't specify an encoding:

In [13]: with codecs.open(filename, 'r') as f:
   ....:     content = f.read() 

In [14]: content
Out[14]: '\xe2\x80\x9cThank you.\xe2\x80\x9d'

content is a str.

When you specify a valid encoding:

In [22]: with codecs.open(filename, 'r', encoding = 'utf-8') as f:
   ....:     content = f.read()


In [23]: content
Out[23]: u'\u201cThank you.\u201d'

content is a unicode.

When you specify an invalid encoding:

In [25]: with codecs.open(filename, 'r', 'ascii') as f:
   ....:     content = f.read()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

You get a UnicodeDecodeError.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
1

In this tutorial on Unicode in Python, in the 4th paragraph, it is written than, describing the codecs.open(filename, mode, [encoding]) function you're using :

encoding is a string giving the encoding to use; if it’s left as None, a regular Python file object that accepts 8-bit strings is returned.

Additionally, in the reference on the File object, it is said that

(file.encoding) may also be None, in which case the file uses the system default encoding for converting Unicode strings.

Calling codecs.open() with no encoding parameter, a File object is returned with a encoding attribute of None (tested), thus using the system default for Unicode, which must have been be UTF-8 in your case. This explains why it's working so neatly when you're not being explicit.

matehat
  • 5,214
  • 2
  • 29
  • 40
  • how do I determine the system default encoding? is using open equivalent to using codecs.open without specifying encoding? – Peter Wooster Jan 19 '13 at 22:46
  • how does this work when I use the builtin open()? I've edited the question to add this. – Peter Wooster Jan 20 '13 at 02:26
  • Yes, using `open()` returns a `File` object with an encoding attribute of `None`, which is the same as `codecs.open` without a encoding parameter. You can find out your system default encoding by doing `sys.getdefaultencoding()`. To change it, see http://stackoverflow.com/questions/2276200/changing-default-encoding-of-python – matehat Jan 21 '13 at 17:36
  • Thanks, I'm on a Mac and sys.getdefaultencoding() return ascii. So it's not obvious why it works. – Peter Wooster Jan 21 '13 at 18:04
  • I see. Can you provide a little more from your code, such as the lines that sends the content of `fh` to the database? – matehat Jan 21 '13 at 20:51
  • You can find all the code at https://github.com/PeterWooster/SQL-Tools/blob/master/SQLRunner.py – Peter Wooster Jan 21 '13 at 21:41
  • Use `sys.getfilesystemencoding()`. `sys.getdefaultencoding()` is the default encoding when converting from Unicode to byte strings, which is normally `ascii` on Python 2.X and `utf-8` on Python 3.X. – Mark Tolonen Jan 22 '13 at 03:58
  • @marktolonen Thanks, the file system encoding is utf8, that explains the problem. If you could post an answer that elaborates on this comment, including how it's set etc. the bounty is yours. – Peter Wooster Jan 24 '13 at 15:54
  • I'm not sure it does, the documentation state that it "returns the name of the encoding used to convert Unicode filenames into system file names, or None if the system default encoding is used". So it's used for file names. It wouldn't theoretically be used for decoding Unicode streams. – matehat Jan 24 '13 at 18:13
  • I've updated my question to add the result of a simple test. neither the default encoding or the filesystem encoding appear to be used. It just reads a string of bytes in no encoding is specified. – Peter Wooster Jan 24 '13 at 21:30
  • Ack, actually I misunderstood. As @matehat says `getfilesystemencoding()` is used for filenames. When y read any data be it a file, socket, etc., you have to know what encoding it is in to convert it to Unicode text. – Mark Tolonen Jan 25 '13 at 02:23