13

How do I encode something in ut8mb4 in Python?

I have two sets of data: data I am migrating to my new MySQL database over from Parse, and data going forward (that talks only to my new database). My database is utf8mb4 in order to store emoji and accented letters.

The first set of data only shows up correctly (when emoji and accents are involved) when I have in my python script:

MySQLdb.escape_string(unicode(xstr(data.get('message'))).encode('utf-8')) 

and when reading from the MySQL database in PHP:

$row["message"] = utf8_encode($row["message"]);

The second set of data only shows up correctly (when emoji and accents are involved) when I DON'T include the utf8_encode($row["message"]) portion. I am trying to reconcile these so that both sets of data are returned correctly to my iOS app. Please help!

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
user3781236
  • 728
  • 2
  • 9
  • 23
  • 1
    I presume you're talking about MySQL? If so then Python's utf8 should be identical to MySQL's utf8mb4. Can you be more specific about what you're doing and the results you see? – Mark Ransom Oct 23 '14 at 16:26
  • It seems utf8mb4 is something MySQL invented because their ut8 implementation only works with BMP characters. This is not the case for normal utf8, which can encode any unicode character. That's the encoding that Python is using. Perhaps MySQL provides a set of functions for working with their custom encoding? – Cameron Oct 23 '14 at 16:29
  • @MarkRansom have added more to the question to be more specific – user3781236 Oct 23 '14 at 17:02
  • It is not clear as to a) what your data looks like in both cases and b) what the data looks like when it doesn't work. Please define clearly what problems you are seeing. – Martijn Pieters Oct 23 '14 at 18:10

4 Answers4

26

I have struggled myself with the correct exchange of the full range of UTF-8 characters between Python and MySQL for the sake of Emoji and other characters beyond the U+FFFF codepoint.

To be sure that everything worked fine, I had to do the following:

  1. make sure utf8mb4 was used for CHAR, VARCHAR, and TEXT columns in MySQL
  2. enforce UTF-8 in Python
  3. enforce UTF-8 to be used between Python and MySQL

To enforce UTF-8 in Python, add the following line as first or second line of your Python script:

# -*- coding: utf-8 -*-

To enforce UTF-8 between Python and MySQL, setup the MySQL connection as follows:

# Connect to mysql.
dbc = MySQLdb.connect(host='###', user='###', passwd='###', db='###', use_unicode=True)

# Create a cursor.
cursor = dbc.cursor()

# Enforce UTF-8 for the connection.
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")

# Do database stuff.

# Commit data.
dbc.commit()

# Close cursor and connection.
cursor.close()
dbc.close()

This way, you don't need to use functions such as encode and utf8_encode.

Tomasz Nguyen
  • 2,561
  • 22
  • 25
  • 2
    Very useful checklist. Rescued me from a steaming . – Bob Stein Nov 03 '15 at 03:08
  • This fixes the issue I was experiencing with mysql + python + emojis – Paulo Malvar Mar 04 '16 at 19:24
  • The `coding` line applies *only to how Python decodes string literals in source code*. Not to how encoding and decoding of *data* is handled. Unless this is Python 2 and you used non-ASCII characters in string literals, there is no need to set this. In Python 3, UTF-8 is *the default source encoding*. – Martijn Pieters Sep 28 '16 at 21:43
  • Beware of this answer if you are on Python3! I've a MySQL database using utf8mb collation. I use MySQLdb with Python3 and I do not need to do anything special to properly read unicode characters. I tried your suggestion and it turned out to perform some sort of double encoding on data, returning some weird unicode characters instead of the right ones. – Augusto Destrero Oct 03 '18 at 20:06
  • @baxeico I am not sure if I ever tested my answer in Python 3 back in the day. I know that above was a big issue in Python 2. I'll do some testing in Python 3 and update my answer. – Tomasz Nguyen Oct 07 '18 at 12:06
  • Actually all I need to add is charset='utf8' argument to mysqldb.connect call. – Augusto Destrero Oct 07 '18 at 12:28
  • I already had the charset='utf8' argument in my connect call but it only worked for me when I added the three `cursor.execute` commands. Thanks! – Mo Beigi Oct 12 '19 at 13:57
25

MySQL's utf8mb4 encoding is just standard UTF-8.

They had to add that name however to distinguish it from the broken UTF-8 character set which only supported BMP characters.

In other words, from the Python side you should always encode to UTF-8 when talking to MySQL, but take into account that the database may not be able to handle Unicode codepoints beyond U+FFFF, unless you use utf8mb4 on the MySQL side.

However, generally speaking, you want to avoid manually encoding and decoding, and instead leave it to MySQLdb worry about this. You do this by configuring your connection and your collations to handle Unicode text transparently. For MySQLdb, that means setting charset='utf8mb4':

database = MySQLdb.connect(
    host=hostname,
    user=username,
    passwd=password,
    db=databasename,
    charset="utf8mb4"
)

Then use normal Python 3 str strings; leave the use_unicode option set to it's default True*.

Note: this handles SET NAMES and SET character_set_connection) for you, there is no need to issue those manually.


* Unless you still use Python 2, then the default is False. Set it to True and use u'...' unicode strings.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
2

use_unicode=True didn't work for me.

My solution

  • in mysql, change entire database, table and field encoding to utf8mb4
  • MySQLdb.connect(host='###' [...], charset='utf8'
  • dbCursor.execute('SET NAMES utf8mb4')
  • dbCursor.execute("SET CHARACTER SET utf8mb4")
Ti Hausmann
  • 926
  • 8
  • 21
2

You can also enter the type of code that you want in the following way

mysql.connector.connect(host = '<host>', database = '<db>', user = '<user>', password = '<password>', charset = 'utf8')

The fields inside '<>' are your own details. Instead of 'utf8' you can also write 'utf8mb4' depending on the type of coding your mysqldb wants.

Simba
  • 23,537
  • 7
  • 64
  • 76
Natty
  • 527
  • 5
  • 10