7

With sqlalchemy 0.9.7, I am trying to store emoji into MySQL 5.5 with utf8mb4 enabled. However, for some reason sqlalchemy is killing my emoji chars and I cannot figure out why. I can see the emoji char in the object before attempting to save the data via sqlalchemy. After the save, an error is thrown and emoji is rendered as ????.

The error is the following. Note it is surrounded by debug messages that output type(post.message) and post.message.

--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:274]:
<type 'unicode'>
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:275]:

--------------------------------------------------------------------------------
/mnt/hgfs/crw/dev/hyper/env/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py:436: Warning: Incorrect string value: '\xF0\x9F\x98\x83' for column 'message' at row 1
  cursor.execute(statement, parameters)
--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:277]:
<type 'unicode'>
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:278]:
????
--------------------------------------------------------------------------------

My code looks like:

Base = declarative_base()

post = Table('post', Base.metadata,
        Column("id", Integer, primary_key=True),
        Column("message", UnicodeText),
        Column("created_at", DateTime),
        Column("updated_at", DateTime),
    )

mapper(Post, post)

Is there any other setup I need to do to make this work?

Update: If I do the following:

Base = declarative_base()

post = Table('post', Base.metadata,
        Column("id", Integer, primary_key=True),
        Column("message", UnicodeText),
        Column("created_at", DateTime),
        Column("updated_at", DateTime),
        mysql_engine='InnoDB',
        mysql_charset='utf8mb4'
    )

mapper(Post, post)

I get the following:

--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:274]:
<type 'unicode'>
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:275]:

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:277]:
<type 'unicode'>
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:278]:
????
--------------------------------------------------------------------------------

No error, but it is still obliterating the emoji. What is going on?!

Update 2:

Changing the database URI being called from:

mysql+mysqldb://user:pass@localhost/datab?charset=utf8

To

mysql+mysqldb://user:pass@localhost/datab

Fixed the problem. sqlalchemy threw an error when I tried to use charset=utf8mb4&use_unicode=0. But is that wise? According to http://docs.sqlalchemy.org/en/rel_0_9/dialects/mysql.html#module-sqlalchemy.dialects.mysql.mysqldb, it is not! Would love any feedback on this solution.

Craig Wright
  • 3,220
  • 1
  • 21
  • 16
  • 1
    One emoji became exactly 4 question marks? That smells like latin1 at some stage, probably the ?charset=utf8 that you removed. Anyway, that probably should be utf8mb4. – Rick James May 02 '15 at 22:16
  • As @RickJames pointed out, an interesting observation is that if you see 1 question mark, that means it was stored correctly as 4-byte UTF-8 string but the current connection is using `unicode` charset. Switch over to `utf8mb4` would make it look fine. – Devy Sep 10 '15 at 19:38
  • f"mysql://{USERNAME}:{PW}@{HOST}:{PORT}/{DB_NAME}?charset=utf8mb4" – ATH Mar 01 '21 at 20:44

2 Answers2

11

Below's how I got emojis etc. working. I'm using Python 3.5 / Flask / Flask-SQLAlchemy.

Note: This fix assumes you're early enough in development that you're OK with recreating your database by running db.drop_all() and db.create_all().

  1. As recommended in this guide, open up a database console and run ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

  2. As recommended in this answer, add ?charset=utf8mb4 to the end of your SQLALCHEMY_DATABASE_URI string.

    • Before: mysql+mysqlconnector://{username}:{password}@{hostname}/{databasename}
    • After: mysql+mysqlconnector://{username}:{password}@{hostname}/{databasename}?charset=utf8mb4
  3. Now just rerun db.drop_all() and db.create_all().

Nathan Wailes
  • 9,872
  • 7
  • 57
  • 95
5

Here is the solution that worked for me. set_unicode gets called the first time a connection enters the connection pool and only gets called once. SET NAMES basically fixes the entire pipeline from your code to the database storage to ensure the correct UTF-8 charset (utf8mb4) is used. As long as utf8mb4 is ensured end to end, emojis should be stored / rendered fine as they are just regular unicode characters outside of BMP (Basic Multilingual Plane), which unfortunately MySQL decided that it was a good idea to only implement 3-byte Unicode implementation for BMP as the unicode charset.

See SQLAlchemy events reference for other relevant event hooks that you can tap into to do connection by connection configuration tweaks.

import logging
from sqlalchemy import event

logger = logging.getLogger(__name__)     

@event.listens_for(Pool, "connect")
def set_unicode(dbapi_conn, conn_record):
    cursor = dbapi_conn.cursor()
    try:
        cursor.execute("SET NAMES 'utf8mb4' COLLATE 'utf8mb4_unicode_ci'")
    except Exception as e:
        logger.debug(e)

Update: regarding extra options in connection string, I prefer to be explicit and specify charset=utf8mb4 because my environment's default charset is utf8(which is the culprit of why emoji cannot be encoded correctly), but never specify use_unicode=0 unless you are running on python 2.x and performance is a bottleneck as you pointed out in the link.

Devy
  • 9,655
  • 8
  • 61
  • 59