Recently I had the same issue with field value being a byte string instead of unicode. Here's a little analysis.
Overview
In general all one needs to do to have unicode values from a cursor, is to pass charset
argument to connection constructor and have non-binary table fields (e.g. utf8_general_ci
). Passing use_unicode
is useless because it is set to true whenever charset
has a value.
MySQLdb respects cursor description field types, so if you have a DATETIME
column in cursor the values will be converted to Python datatime.datetime
instances, DECIMAL
to decimal.Decimal
and so on, but binary values will be represented as is, by byte strings. Most of decoders are defined in MySQLdb.converters
, and one can override them on instance basis by providing conv
argument to connection constructor.
But unicode decoders are an exception here, which is likely a design shortcoming. They are appended directly to connection instance converters in its constructor. So it's only possible to override them on instance-basic.
Workaround
Let's see the issue code.
import MySQLdb
connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
cursor = connection.cursor()
cursor.execute(u"SELECT 'abcdё' `s`, ExtractValue('<a>abcdё</a>', '/a') `b`")
print cursor.fetchone()
# (u'abcd\u0451', 'abcd\xd1\x91')
print cursor.description
# (('s', 253, 6, 15, 15, 31, 0), ('b', 251, 6, 50331648, 50331648, 31, 1))
print cursor.description_flags
# (1, 0)
It shows that b
field is returned as a byte string instead of unicode. However it is not binary, MySQLdb.constants.FLAG.BINARY & cursor.description_flags[1]
(MySQLdb field flags). It seems like bug in the library (opened #90). But the reason for it I see as MySQLdb.constants.FIELD_TYPE.LONG_BLOB
(cursor.description[1][1] == 251
, MySQLdb field types) just hasn't a converter at all.
import MySQLdb
import MySQLdb.converters as conv
import MySQLdb.constants as const
connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
connection.converter[const.FIELD_TYPE.LONG_BLOB] = connection.converter[const.FIELD_TYPE.BLOB]
cursor = connection.cursor()
cursor.execute(u"SELECT 'abcdё' `s`, ExtractValue('<a>abcdё</a>', '/a') `b`")
print cursor.fetchone()
# (u'abcd\u0451', u'abcd\u0451')
print cursor.description
# (('s', 253, 6, 15, 15, 31, 0), ('b', 251, 6, 50331648, 50331648, 31, 1))
print cursor.description_flags
# (1, 0)
Thus by manipulating connection instance converter
dict, it is possible to achieve desired unicode decoding behaviour.
If you want to override the behaviour here's how a dict entry for possible text field looks like after constructor.
import MySQLdb
import MySQLdb.constants as const
connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
print connection.converter[const.FIELD_TYPE.BLOB]
# [(128, <type 'str'>), (None, <function string_decoder at 0x7fa472dda488>)]
MySQLdb.constants.FLAG.BINARY == 128
. This means that if a field has binary flag it will be str
, otherwise unicode decoder will be applied. So you want to try to convert binary values as well, you can pop the first tuple.