38

I am trying to push user account data from an Active Directory to our MySQL-Server. This works flawlessly but somehow the strings end up showing an encoded version of umlauts and other special characters.

The Active Directory returns a string using this sample format: M\xc3\xbcller

This actually is the UTF-8 encoding for Müller, but I want to write Müller to my database not M\xc3\xbcller.

I tried converting the string with this line, but it results in the same string in the database: tempEntry[1] = tempEntry[1].decode("utf-8")

If I run print "M\xc3\xbcller".decode("utf-8") in the python console the output is correct.

Is there any way to insert this string the right way? I need this specific format for a web developer who wants to have this exact format, I don't know why he is not able to convert the string using PHP directly.

Additional info: I am using MySQLdb; The table and column encoding is utf8_general_ci

Raptor
  • 736
  • 2
  • 8
  • 21

8 Answers8

56

As @marr75 suggests, make sure you set charset='utf8' on your connections. Setting use_unicode=True is not strictly necessary as it is implied by setting the charset.

Then make sure you are passing unicode objects to your db connection as it will encode it using the charset you passed to the cursor. If you are passing a utf8-encoded string, it will be doubly encoded when it reaches the database.

So, something like:

conn = MySQLdb.connect(host="localhost", user='root', password='', db='', charset='utf8')
data_from_ldap = 'M\xc3\xbcller'
name = data_from_ldap.decode('utf8')
cursor = conn.cursor()
cursor.execute(u"INSERT INTO mytable SET name = %s", (name,))

You may also try forcing the connection to use utf8 by passing the init_command param, though I'm unsure if this is required. 5 mins testing should help you decide.

conn = MySQLdb.connect(charset='utf8', init_command='SET NAMES UTF8')

Also, and this is barely worth mentioning as 4.1 is so old, make sure you are using MySQL >= 4.1

Rob Cowie
  • 22,259
  • 6
  • 62
  • 56
  • Is that true for all MySQLdb.execute statements - that you must decode('utf8') to unicode and submit in string like so? The results on my test are then unicode in the DB and not utf8 encoded, but I get latin1 crap if I try to submit unicode directly. – Marc Maxmeister Mar 25 '14 at 16:56
19

Assuming you are using MySQLdb you need to pass use_unicode=True and charset="utf8" when creating your connection.

UPDATE: If I run the following against a test table I get -

>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
>>> c = db.cursor()
>>> c.execute("INSERT INTO last_names VALUES(%s)", (u'M\xfcller', ))
1L
>>> c.execute("SELECT * FROM last_names")
1L
>>> print c.fetchall()
(('M\xc3\xbcller',),)

This is "the right way", the characters are being stored and retrieved correctly, your friend writing the php script just isn't handling the encoding correctly when outputting.

As Rob points out, use_unicode and charset combined is being verbose about the connection, but I have a natural paranoia about even the most useful python libraries outside of the standard library so I try to be explicit to make bugs easy to find if the library changes.

marr75
  • 5,666
  • 1
  • 27
  • 41
  • Yes I am using MySQLdb. Forgot to mention that in my previous post. I tried setting charset and use_unicode manually (though the former seems to imply the latter), but the result is the same. I also tried setting .decode("utf-8") in the INSERT-Statement. Still has the same format... – Raptor Jun 01 '11 at 14:55
  • What's the encoding set to on the column in your database? Try utf8-bin. You might be transmitting the data perfectly correctly but it is written in some encoding that does not include the characters you are using. – marr75 Jun 01 '11 at 15:00
  • The encoding was `utf8_general_ci`, i tried setting the table and the columns to `utf8_bin` with no effect. – Raptor Jun 01 '11 at 15:08
  • Updated to reflect my results after testing. – marr75 Jun 01 '11 at 16:17
  • I think this is the right direction, I tried to manually set the string and this seems to work. Something seems to be wrong with my conversion in python i think. I will continue debugging this during the next days and post a solution if possible. Btw: Thank you for your testing! – Raptor Jun 01 '11 at 16:26
  • @Raptor: The key point here is to pass the _unicode_ object to c.execute, _not_ the utf8 encoded string. MySQLdb will encode it for you on the way in (using charset param) and decode it for you on the way out (if use_unicode is True). – Rob Cowie Jun 01 '11 at 16:51
  • @Rob Cowie I think you're onto something as far as the problem raptor is having, u'M\xfcller' is the output of "M\xc3\xbcller".decode("utf8"), I assumed he was decoding the string anyway since that was in his answer, I'll update my sample code to reflect this step. – marr75 Jun 01 '11 at 22:29
  • I printed the entire array I put into the MySQL-DB and my example lists as follows: `u'M\\xc3\\xbcller'`. Shouldn't this have only one leading \ before each character? I tried getting rid of this \\ trying .replace('\\\\','\\') but it does not seem to work. Could this be my Problem? The MySQL-Database has the following in the fields after an insert: `M\xc3\xbcller` – Raptor Jun 03 '11 at 11:45
11
import MySQLdb

# connect to the database
db = MySQLdb.connect("****", "****", "****", "****") #don't use charset here

# setup a cursor object using cursor() method
cursor = db.cursor()

cursor.execute("SET NAMES utf8mb4;") #or utf8 or any other charset you want to handle

cursor.execute("SET CHARACTER SET utf8mb4;") #same as above

cursor.execute("SET character_set_connection=utf8mb4;") #same as above

# run a SQL question
cursor.execute("****")

#and make sure the MySQL settings are correct, data too
Peter O.
  • 32,158
  • 14
  • 82
  • 96
YEH
  • 374
  • 4
  • 18
10

I found the solution to my problems. Decoding the String with .decode('unicode_escape').encode('iso8859-1').decode('utf8') did work at last. Now everything is inserted as it should. The full other solution can be found here: Working with unicode encoded Strings from Active Directory via python-ldap

Community
  • 1
  • 1
Raptor
  • 736
  • 2
  • 8
  • 21
5

Recently I had the same issue with field value being a byte string instead of unicode. Here's a little analysis.

Overview

In general all one needs to do to have unicode values from a cursor, is to pass charset argument to connection constructor and have non-binary table fields (e.g. utf8_general_ci). Passing use_unicode is useless because it is set to true whenever charset has a value.

MySQLdb respects cursor description field types, so if you have a DATETIME column in cursor the values will be converted to Python datatime.datetime instances, DECIMAL to decimal.Decimal and so on, but binary values will be represented as is, by byte strings. Most of decoders are defined in MySQLdb.converters, and one can override them on instance basis by providing conv argument to connection constructor.

But unicode decoders are an exception here, which is likely a design shortcoming. They are appended directly to connection instance converters in its constructor. So it's only possible to override them on instance-basic.

Workaround

Let's see the issue code.

import MySQLdb

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
cursor     = connection.cursor()

cursor.execute(u"SELECT 'abcdё' `s`, ExtractValue('<a>abcdё</a>', '/a') `b`")

print cursor.fetchone() 
# (u'abcd\u0451', 'abcd\xd1\x91')
print cursor.description 
# (('s', 253, 6, 15, 15, 31, 0), ('b', 251, 6, 50331648, 50331648, 31, 1))
print cursor.description_flags 
# (1, 0)

It shows that b field is returned as a byte string instead of unicode. However it is not binary, MySQLdb.constants.FLAG.BINARY & cursor.description_flags[1] (MySQLdb field flags). It seems like bug in the library (opened #90). But the reason for it I see as MySQLdb.constants.FIELD_TYPE.LONG_BLOB (cursor.description[1][1] == 251, MySQLdb field types) just hasn't a converter at all.

import MySQLdb
import MySQLdb.converters as conv
import MySQLdb.constants as const

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
connection.converter[const.FIELD_TYPE.LONG_BLOB] = connection.converter[const.FIELD_TYPE.BLOB]
cursor = connection.cursor()

cursor.execute(u"SELECT 'abcdё' `s`, ExtractValue('<a>abcdё</a>', '/a') `b`")

print cursor.fetchone()
# (u'abcd\u0451', u'abcd\u0451')
print cursor.description
# (('s', 253, 6, 15, 15, 31, 0), ('b', 251, 6, 50331648, 50331648, 31, 1))
print cursor.description_flags
# (1, 0)

Thus by manipulating connection instance converter dict, it is possible to achieve desired unicode decoding behaviour.

If you want to override the behaviour here's how a dict entry for possible text field looks like after constructor.

import MySQLdb
import MySQLdb.constants as const

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
print connection.converter[const.FIELD_TYPE.BLOB]
# [(128, <type 'str'>), (None, <function string_decoder at 0x7fa472dda488>)]

MySQLdb.constants.FLAG.BINARY == 128. This means that if a field has binary flag it will be str, otherwise unicode decoder will be applied. So you want to try to convert binary values as well, you can pop the first tuple.

saaj
  • 23,253
  • 3
  • 104
  • 105
2

(Would like to reply to above answer but do not have enough reputation...)

The reason why you don't get unicode results in this case:

>>> print c.fetchall()
(('M\xc3\xbcller',),)

is a bug from MySQLdb 1.2.x with *_bin collation, see:

http://sourceforge.net/tracker/index.php?func=detail&aid=1693363&group_id=22307&atid=374932 http://sourceforge.net/tracker/index.php?func=detail&aid=2663436&group_id=22307&atid=374932

In this particular case (collation utf8_bin - or [anything]_bin...) you have to expect the "raw" value, here utf-8 (yes, this sucks as there is no generic fix).

lacorbeille
  • 325
  • 1
  • 4
  • 8
1

there is another situation maybe a little rare.

if you create a schema in mysqlworkbench firstly,you will get the encoding error and can't solve it by add charset configuration.

it is because mysqlworkbench create schema by latin1 by default, so you should set the charset at first! enter image description here

dogewang
  • 648
  • 1
  • 7
  • 15
0

and db.set_character_set('utf8'), imply that use_unicode=True ?

Sérgio
  • 6,966
  • 1
  • 48
  • 53