Python with MySql unicode problems

Question

I need to call MySQL stored procedure from my python script. As one of parameters I'm passing a unicode string (Russian language), but I get an error;

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)

My script:

  self.db=MySQLdb.connect("localhost", "usr", "pass", "dbName")
  self.cursor=self.db.cursor()
  args=("какой-то текст") #this is string in russian
  self.cursor.callproc('pr_MyProc', args)
  self.cursor.execute('SELECT @_pr_MyProc_2') #getting result from sp
  result=self.cursor.fetchone()
  self.db.commit()

I've read that setting charset='utf8' shuld resolve this problem, but when I use string:

self.db=MySQLdb.connect("localhost", "usr", "pass", "dbName", charset='utf8')

This gives me another error;

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd1' in position 20: surrogates not allowed

Also I've trying to set parametr use_unicode=True, that's not working.

Does it work if you send the string in unicode: `args=(u"какой-то текст")`? — Daniel Roseman, Aug 23 '16 at 06:21
try http://stackoverflow.com/questions/27366479/python-3-os-walk-file-paths-unicodeencodeerror-utf-8-codec-cant-encode-s — safin chacko, Aug 30 '16 at 05:56
what's also strange is that that Cyrillic (?) string can be decoded into UTF-8 without issue. — Danielle M., Aug 30 '16 at 16:17

score 5 · Answer 1 · answered Aug 23 '16 at 17:47

5

More things to check on: http://mysql.rjweb.org/doc.php/charcoll#python

Likely items:

Start code file with # -*- coding: utf-8 -*- -- (for literals in code)
Literals should be u'...'

Can you extract the HEX? какой-то текст should be this in utf8: D0BA D0B0 D0BA D0BE D0B9 2D D182 D0BE D182 20 D0B5 D0BA D181 D182

answered Aug 23 '16 at 17:47

Rick James

135,179
13
127
222

2

As I understand "# -*- coding: utf-8 -*-" sets encoding for file, not for strings. So it does'not help. – Gleb Aug 25 '16 at 06:48
Does it hurt in your situation? – Rick James Aug 25 '16 at 16:19

regilero · Answer 2 · 2017-08-21T07:53:14.460

Here are some thoughts. Maybe not a response. I've been playing with python/mysql/utf-8/unicode in the past and this is the things i remember:

Looking at Saltstack mysql module's comment :

https://github.com/saltstack/salt/blob/develop/salt/modules/mysql.py#L314-L322

# MySQLdb states that this is required for charset usage
# but in fact it's more than it's internally activated
# when charset is used, activating use_unicode here would
# retrieve utf8 strings as unicode() objects in salt
# and we do not want that.
#_connarg('connection_use_unicode', 'use_unicode')
connargs['use_unicode'] = False
_connarg('connection_charset', 'charset')

We see that to avoid altering the result string the use_unicode is set to False, while the charset (which could be utf-8) is set as a parameter. use_unicode is more a 'request' to get responses as unicode strings.

You can check real usage in the tests, here: https://github.com/saltstack/salt/blob/develop/tests/integration/modules/test_mysql.py#L311-L361 with a database named '標準語'.

Now about the message UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd1' **. You are using **unicode but you tell the module it is utf-8. It is not utf-8 until you encode your unicode string in utf-8.

Maybe you should try with:

args=(u"какой-то текст".encode('utf-8'))

At least in python3 this is required, because your "какой-то текст" is not in utf-8 by default.

Link is now broken: I found the new link: https://github.com/saltstack/salt/blob/develop/tests/integration/modules/test_mysql.py#L311-L361 (Note file_name change) — Brandon Deo, Aug 08 '17 at 19:31

score 2 · Answer 3 · edited May 23 '17 at 11:51

2

The MySQLdb module is not compatible with python 3. That might be why you are getting problems. I would advise to use a different connector, like PyMySQL or mysqlclient.

Related: 23376103.

edited May 23 '17 at 11:51

Community

1
1

answered Aug 25 '16 at 09:52

Lucas Moeskops

5,445
3
28
42

1

I'm using mysqlclinet – Gleb Aug 25 '16 at 10:53

score 2 · Answer 4 · answered Aug 30 '16 at 13:30

2

Maybe you can reload your sys in utf-8 and try to decode the string into utf-8 as following :

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

...

stringUtf8 = u''.join(string_original).decode('utf-8')

answered Aug 30 '16 at 13:30

Floran Gmehlin

824
1
11
34

score 2 · Answer 5 · answered Aug 30 '16 at 14:19

I had a similar problem very recently but with PostgreSQL. After trying tons of suggestions from SO/ internet, I realized the issue was with my database. I had to drop my database and reinstall Postgres, because for some reason it was not allowing me to change the database's default collation. I was in a hurry so couldn't find a better solution, but would recommend the same, since I was only starting my application in the deployment environment. All the best.

score 1 · Answer 6 · answered Aug 30 '16 at 01:09

1

What's your database's charset?
use :

show variables like "characetr%";

or see your database's charset

answered Aug 30 '16 at 01:09

JonahCui

125
1
5

score 1 · Answer 7 · answered Aug 30 '16 at 14:10

I see here two problems.

You have unicode but you try to define it as utf-8 by setting parameter "charset". You should first encode your unicode to utf-8 or another encoding system.
If it however doesn't work, try to do so with init_command='SET NAMES UTF8' parameter.

So it will look like:

conn = MySQLdb.connect(charset='utf8', init_command='SET NAMES UTF8')

You can try also this:

cursor = db.cursor()

cursor.execute("SET NAMES UTF8;")

cJ Zougloub · Answer 8 · 2017-12-31T07:47:43.687

I encountered a similar issue, which was caused by invalid utf-8 data in the database; it seems that MySQL doesn't care about that, but Python does, because it's following the UTF-8 spec, which says that:

surrogate pairs are not allowed in utf-8
unpaired surrogates are not allowed in utf-8

If you want to "make it work", you'll have to intercept the MySQL packet and use your own converter which will perform ad-hoc replacements.

Here's one way to "handle" invalid data containing surrogates:

def borked_utf8_decode(data):
    """
    Work around input with unpaired surrogates or surrogate pairs,
    replacing by XML char refs: look for "&#\d+;" after.
    """
    return data.decode("utf-8", "surrogatepass") \
     .encode("utf-8", "xmlcharrefreplace") \
     .decode("utf-8")

Note that the proper way to handle that is context-dependent, but there are some common replacement scenarios, like this one.

And here's one way of plugging this into pymysql (another way is to monkey-patch field processing, see eg. https://github.com/PyMySQL/PyMySQL/issues/631):

import pymysql.converters

# use this in your connection
pymysql_use_unicode = False
conversions = pymysql.converters.conversions
conversions[pymysql.converters.FIELD_TYPE.STRING] = borked_utf8_decode
conversions[pymysql.converters.FIELD_TYPE.VAR_STRING] = borked_utf8_decode
conversions[pymysql.converters.FIELD_TYPE.VARCHAR] = borked_utf8_decode

Python with MySql unicode problems

8 Answers8