When you use
fh = codecs.open(fname,'r','utf8')
fh.read()
returns a unicode. If you take this unicode and use your database driver (such as mysql-python) to insert data into your database, then the driver is responsible for converting the unicode into bytes. The driver is using the encoding set by
con.set_character_set('utf8')
If you use
fh = open(fname, 'r')
then fh.read()
returns a string of bytes. You are at the mercy of whatever bytes happened to be in fname
. Fortunately, according to your post, the file is encoded in UTF-8. Since the data is already a string of bytes, the driver does not perform any encoding, and simply communicates the string of bytes as is to the database.
Either way, the same string of UTF-8 encoded bytes gets inserted into the database.
Let's take a look at the source code defining codecs.open:
def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
if encoding is not None:
if 'U' in mode:
# No automatic conversion of '\n' is done on reading and writing
mode = mode.strip().replace('U', '')
if mode[:1] not in set('rwa'):
mode = 'r' + mode
if 'b' not in mode:
# Force opening of the file in binary mode
mode = mode + 'b'
file = __builtin__.open(filename, mode, buffering)
if encoding is None:
return file
info = lookup(encoding)
srw = StreamReaderWriter(file, info.streamreader, info.streamwriter, errors)
# Add attributes to simplify introspection
srw.encoding = encoding
return srw
Notice in particular what happens if no encoding
is set:
file = __builtin__.open(filename, mode, buffering)
if encoding is None:
return file
So codecs.open
is essentially the same as the builtin open
when no encoding is set. The builtin open
returns a file object whose read
method returns a str object. It does no decoding at all.
In contrast, when you specify an encoding codecs.open
returns a StreamReaderWriter
with srw.encoding
set to encoding
. Now when you call the StreamReaderWriter
's read
method, a unicode object is returned -- usually. First the str object must be decoded using the specified encoding.
In your example, the str
object is
In [19]: content
Out[19]: '\xe2\x80\x9cThank you.\xe2\x80\x9d'
and if you specify the encoding as 'ascii'
, then the StreamReaderWriter
tries to decode content
using the 'ascii'
encoding:
In [20]: content.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
That's not surprising since the ascii
encoding can only decode bytes in the range 0--127, and '\xe2'
, the first byte in content
, has ordinal value outside that range.
For concreteness: When you don't specify an encoding:
In [13]: with codecs.open(filename, 'r') as f:
....: content = f.read()
In [14]: content
Out[14]: '\xe2\x80\x9cThank you.\xe2\x80\x9d'
content
is a str
.
When you specify a valid encoding:
In [22]: with codecs.open(filename, 'r', encoding = 'utf-8') as f:
....: content = f.read()
In [23]: content
Out[23]: u'\u201cThank you.\u201d'
content
is a unicode
.
When you specify an invalid encoding:
In [25]: with codecs.open(filename, 'r', 'ascii') as f:
....: content = f.read()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
You get a UnicodeDecodeError
.