I'm having a weird problem regarding Unicode handling with SQLAlchemy. In short, when I insert a Python unicode string into an Unicode column of my MySQL database, I have no trouble getting it back out. On the database side, however, it gets stored as a weird 4-byte sequence (and no, this doesn't seem to have anything to do with the 'utf8mb4' default on MySQL)
My problem is that I have a MySQL dump from another machine that contains straight UTF8 characters in the SQL. When I try to retrieve data imported from that other machine I get UnicodeDecodeErrors all the time.
Below I've included a minimal example that illustrates the problem.
utf8test.sql: Set up a database and create one row with a Unicode character in it
utf8test.py: Open DB using SQLAlchemy, insert 1 row with Python's idea of an UTF character, and retrieve both rows.
It turns out that Python can retrieve the data it inserted itself fine, but it balks at the literal 'ä' I put into the SQL import script. Investigation of the hexdumps of both an mysqldumped dataset and the binary data files of MySQL itself shows that the UTF character inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'), whereas the Python-inserted 'ä' gets converted to the sequence 'c3 83 c2 a4' which I don't understand (see hexdump down below; I've used 'xxx' and 'yyy' as markers to faciliate finding them in the hexdump).
Can anybody shed any light on this?
This creates the test DB:
dh@jenna:~/python$ cat utf8test.sql
DROP DATABASE IF EXISTS utftest;
CREATE DATABASE utftest;
USE utftest;
CREATE TABLE x (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
text VARCHAR(10)
);
INSERT INTO x(text) VALUES ('xxxü');
COMMIT;
dh@jenna:~/python$ mysql < utf8test.sql
Here's the Pyhton script:
dh@jenna:~/python$ cat utf8test.py
# -*- encoding: utf8 -*-
from sqlalchemy import create_engine, Column, Unicode, Integer
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class X(Base):
__tablename__ = 'x'
id = Column(Integer, primary_key=True)
text = Column(Unicode(10))
engine = create_engine('mysql://localhost/utftest',
encoding='utf8')
Base.metadata.create_all(engine)
Session = sessionmaker(engine)
db = Session()
x = X(text=u'yyyä')
db.add(x)
db.commit()
rs = db.query(X.text).all()
for r in rs:
print(r.text)
db.close()
This happens when I run the script (runs without error when I omit the INSERT INTO bit in utf8test.sql):
dh@jenna:~/python$ python utf8test.py
Traceback (most recent call last):
File "utf8test.py", line 23, in <module>
rs = db.query(X.text).all()
[...]
UnicodeDecodeError: 'utf8' codec can't decode
byte 0xfc in position 3: invalid start byte
Here's a hexdump to confirm that the two ä's are indeed stored differently in the DB. Using hd I've also conformed that both the Python as well as the SQL scripts are indeed UTF.
dh@jenna:~/python$ mysqldump utftest | hd
00000000 2d 2d 20 4d 79 53 51 4c 20 64 75 6d 70 20 31 30 |-- MySQL dump 10|
00000010 2e 31 36 20 20 44 69 73 74 72 69 62 20 31 30 2e |.16 Distrib 10.|
00000020 31 2e 33 37 2d 4d 61 72 69 61 44 42 2c 20 66 6f |1.37-MariaDB, fo|
00000030 72 20 64 65 62 69 61 6e 2d 6c 69 6e 75 78 2d 67 |r debian-linux-g|
00000040 6e 75 20 28 69 36 38 36 29 0a 2d 2d 0a 2d 2d 20 |nu (i686).--.-- |
[...]
00000520 4c 45 20 4b 45 59 53 20 2a 2f 3b 0a 49 4e 53 45 |LE KEYS */;.INSE|
00000530 52 54 20 49 4e 54 4f 20 60 78 60 20 56 41 4c 55 |RT INTO `x` VALU|
00000540 45 53 20 28 31 2c 27 78 78 78 c3 bc 27 29 2c 28 |ES (1,'xxx..'),(|
00000550 32 2c 27 79 79 79 c3 83 c2 a4 27 29 3b 0a 2f 2a |2,'yyy....');./*|