2

I'm having a weird problem regarding Unicode handling with SQLAlchemy. In short, when I insert a Python unicode string into an Unicode column of my MySQL database, I have no trouble getting it back out. On the database side, however, it gets stored as a weird 4-byte sequence (and no, this doesn't seem to have anything to do with the 'utf8mb4' default on MySQL)

My problem is that I have a MySQL dump from another machine that contains straight UTF8 characters in the SQL. When I try to retrieve data imported from that other machine I get UnicodeDecodeErrors all the time.

Below I've included a minimal example that illustrates the problem.

  • utf8test.sql: Set up a database and create one row with a Unicode character in it

  • utf8test.py: Open DB using SQLAlchemy, insert 1 row with Python's idea of an UTF character, and retrieve both rows.

It turns out that Python can retrieve the data it inserted itself fine, but it balks at the literal 'ä' I put into the SQL import script. Investigation of the hexdumps of both an mysqldumped dataset and the binary data files of MySQL itself shows that the UTF character inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'), whereas the Python-inserted 'ä' gets converted to the sequence 'c3 83 c2 a4' which I don't understand (see hexdump down below; I've used 'xxx' and 'yyy' as markers to faciliate finding them in the hexdump).

Can anybody shed any light on this?

This creates the test DB:

dh@jenna:~/python$ cat utf8test.sql
DROP DATABASE IF EXISTS utftest;
CREATE DATABASE utftest;
USE utftest;
CREATE TABLE x (
    id INTEGER PRIMARY KEY AUTO_INCREMENT,
        text VARCHAR(10)
        );
INSERT INTO x(text) VALUES ('xxxü');
COMMIT;
dh@jenna:~/python$ mysql < utf8test.sql

Here's the Pyhton script:

dh@jenna:~/python$ cat utf8test.py
# -*- encoding: utf8 -*-

from sqlalchemy import create_engine, Column, Unicode, Integer
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()
class X(Base):
    __tablename__ = 'x'
    id = Column(Integer, primary_key=True)
    text = Column(Unicode(10))

engine = create_engine('mysql://localhost/utftest',
    encoding='utf8')
Base.metadata.create_all(engine)
Session = sessionmaker(engine)

db = Session()
x = X(text=u'yyyä')
db.add(x)
db.commit()

rs = db.query(X.text).all()
for r in rs:
    print(r.text)

db.close()

This happens when I run the script (runs without error when I omit the INSERT INTO bit in utf8test.sql):

dh@jenna:~/python$ python utf8test.py
Traceback (most recent call last):
  File "utf8test.py", line 23, in <module>
      rs = db.query(X.text).all()
[...]
UnicodeDecodeError: 'utf8' codec can't decode
    byte 0xfc in position 3: invalid start byte

Here's a hexdump to confirm that the two ä's are indeed stored differently in the DB. Using hd I've also conformed that both the Python as well as the SQL scripts are indeed UTF.

dh@jenna:~/python$ mysqldump utftest | hd
00000000  2d 2d 20 4d 79 53 51 4c  20 64 75 6d 70 20 31 30  |-- MySQL dump 10|
00000010  2e 31 36 20 20 44 69 73  74 72 69 62 20 31 30 2e  |.16  Distrib 10.|
00000020  31 2e 33 37 2d 4d 61 72  69 61 44 42 2c 20 66 6f  |1.37-MariaDB, fo|
00000030  72 20 64 65 62 69 61 6e  2d 6c 69 6e 75 78 2d 67  |r debian-linux-g|
00000040  6e 75 20 28 69 36 38 36  29 0a 2d 2d 0a 2d 2d 20  |nu (i686).--.-- |
[...]
00000520  4c 45 20 4b 45 59 53 20  2a 2f 3b 0a 49 4e 53 45  |LE KEYS */;.INSE|
00000530  52 54 20 49 4e 54 4f 20  60 78 60 20 56 41 4c 55  |RT INTO `x` VALU|
00000540  45 53 20 28 31 2c 27 78  78 78 c3 bc 27 29 2c 28  |ES (1,'xxx..'),(|
00000550  32 2c 27 79 79 79 c3 83  c2 a4 27 29 3b 0a 2f 2a  |2,'yyy....');./*|
musbur
  • 567
  • 4
  • 16
  • Don't know the whole story, but `c3 83 c2 a4` is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: `'ä'.encode('utf-8').decode('latin-1').encode('utf-8')` – Ilja Everilä Nov 26 '18 at 11:33

2 Answers2

0

c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here

http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.

Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python

But this is scary: I see c3 bc (Mojibake for ü) and c3 83 c2 a4 (double-encoding of ä. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.

Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.

Bottom line: You have multiple errors.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times. – musbur Nov 27 '18 at 13:27
  • @musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former. – Rick James Nov 27 '18 at 18:59
  • The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy – musbur Nov 29 '18 at 04:48
  • confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later. – musbur Nov 29 '18 at 05:20
  • @musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places. – Rick James Nov 29 '18 at 21:43
-1

Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.

musbur
  • 567
  • 4
  • 16