1

I want to use a API from a game and store the player and clan names in a local database. The names can contain all sorts of characters and emoticons. Here are just a few examples I found:

  • яαℓαηι
  • نکل
  • 窝猫
  • 鐵擊道遊隊
  • ❤✖❤♠️♦️♣️✖

I use python for reading the api and write it into a mysql database. After that, I want to use the names on a Node.js web application.

What is the best way to encode those characters and how can I savely store them in the database, so that I can display them correcly afterwards?

I tried to encode the strings in python with utf-8:

>>> sample = '蛙喜鄉民CLUB'
>>> sample
'蛙喜鄉民CLUB'
>>> sample = sample.encode('UTF-8')
>>> sample
b'\xe8\x9b\x99\xe5\x96\x9c\xe9\x84\x89\xe6\xb0\x91CLUB'

and storing the encoded string in a mysql database with utf8mb4_unicode_ci character set.

When I store the string from above and select it inside mysql workbench it is displayed like this:

蛙喜鄉民CLUB

When I read this string from the database again in python (and store it in db_str) I get:

>>> db_str
èåéæ°CLUB
>>> db_str.encode('UTF-8')
b'\xc3\xa8\xc2\x9b\xc2\x99\xc3\xa5\xc2\x96\xc2\x9c\xc3\xa9\xc2\x84\xc2\x89\xc3\xa6\xc2\xb0\xc2\x91CLUB'

The first output is total gibberish, the second one with utf-8 looks mostly like the encoded string from above, but with added \xc2 or \xc3 between each byte.

How can I save such strings into mysql, so that I can read them again and display them correctly inside a python script?

Is my database collation utf8mb4_unicode_ci not suitable for such content? Or do I have to use another encoding?

Jack O'Neill
  • 1,032
  • 2
  • 19
  • 34
  • I don't know which mysql library you're using, but I'm willing to bet that the answer is just to save the strings as strings, instead of encoding them. What's probably happening is that you pass it the UTF-8-encoded bytes to the library, which makes a guess and decodes them as Latin-1 or cp1252 or similar to Unicode, and then stores the resulting mojibake as UTF-8. When you fetch it—whether in your workbench or in Python—it gets decoded back from UTF-8, faithfully restoring the same useless mojibake you stored. – abarnert Aug 24 '18 at 20:40
  • In general, "looks a lot like UTF-8 but with extra characters ranging from `\xc2` to `\xc8` before each non-ASCII character" is worth learning to recognize, because that's what you get when you encode to UTF-8, decode as a Latin-1-compatible character set, then encode to UTF-8 again. – abarnert Aug 24 '18 at 20:44
  • thanks for your input. I didn't consider the problem was on the library side and thought I did some other stuff wrong. I use the MySQLdb library and after adding `charset` to the connection object the encoded strings are now written and also returned correctly after that. – Jack O'Neill Aug 24 '18 at 21:30
  • MySQLdb is an abandoned project that just barely works with Python 3 and MySQL 5.5, so I'm not surprised that it has issues with this stuff. Is there a reason you're not using one of the supported, modern libraries (I think that's PyMySQL, mysqlclient, CMySQL, and MySQL Connector/Python)? – abarnert Aug 24 '18 at 21:45
  • See Mojibake in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored Your first hex string is correct utf8mb4; the second is "double-encoding". – Rick James Aug 26 '18 at 00:48
  • sorry, I use the mysqlclient library, not the old MySQLdb one. mysqlclient uses the same package name, which led to the confusion before. – Jack O'Neill Aug 27 '18 at 14:31

1 Answers1

1

As described by @abarnert in a comment to the question, the problem was that the library used for written the unicode strings didn't know that utf-8 should be used and therefor encoded the strings wrong.

After adding charset='utf8mb4' as parameter to the mysql connection the string get written correctly in the intended encoding.

All I had to change was

conn = MySQLdb.connect(host, user, pass, db, port)

to

conn = MySQLdb.connect(host, user, pass, db, port, charset='utf8mb4')

and after that my approach described in the question worked flawlessly.

edit: after declaring the charset='utf8mb4' parameter on the connection object it is no longer necessary to encode the strings, as that gets now already successfully done by the mysqlclient library.

Jack O'Neill
  • 1,032
  • 2
  • 19
  • 34
  • 1
    While this should work (and apparently does), I don't think you actually need this, and it's not what I was suggesting. Instead of encoding your data to UTF-8 bytes and telling the connector to expect UTF-8 bytes, what happens if you just send the data as Unicode strings in the first place, without calling `encode`? Maybe that won't work with ancient MySQLdb, but I'd at least try it before making things more complicated. – abarnert Aug 24 '18 at 21:48
  • 1
    when i try to persist the strings without encoding and without the connection parameter I get gibberish characters in the database. I tried again without the `encode('utf-8')` call of the string but with the `charset='utf8mb4'` parameter on the connection, and that seems to work as well. So just declare with charset to use on the connection is sufficient. Encoding is not required. Thanks for that. Makes things a lot simpler. – Jack O'Neill Aug 25 '18 at 14:25
  • Great! You should accept your own answer, so anyone else who has the same problem in the future will find your question and answer and know how to solve it. – abarnert Aug 25 '18 at 18:38