unicode values not being encoded correctly

Question

I'm connecting to a MySQL db from Python 3.5 and querying for some VARCHAR values that contain unicode characters. The behavior seems to have changed when we switched from Python 2.7 to 3.x.

When connecting, I specify charset "utf8" and use_unicode "True". This yields query results that are string objects, but with the wrong unicode characters.

HOWEVER! I can simulate the old behavior by setting use_unicode to "False" and then calling "decode('utf-8')" on the byte sequences returned by the query.

Also, when I connect via the command line client and select values manually, the results are the same "mangled" values the Python script generates when use_unicode is set to "True".

As an example, here's a "correct" value that I get when use_unicode is False and I decode the byte sequence myself in Python:

Hēroïne

...and here is the string object I get back when use_unicode is True:

HÄroÃ¯ne

Any thoughts? Or direction where I should start looking?

EDITED TO ADD:

I can get the command line client to display the values correctly by setting character_set_results to "latin1". When character_set_results is "utf8" I get the mangled version. This is weird, though, because the table and column I'm querying are both configured to use "utf8" as their character set. The database itself, however, looks like it's latin1:

mysql> show variables like 'character_set%';
+--------------------------+-------------------------------------------+
| Variable_name            | Value                                     |
+--------------------------+-------------------------------------------+
| character_set_client     | utf8                                      |
| character_set_connection | utf8                                      |
| character_set_database   | latin1                                    |
| character_set_filesystem | binary                                    |
| character_set_results    | latin1                                    |
| character_set_server     | latin1                                    |
| character_set_system     | utf8                                      |
| character_sets_dir       | /rdsdbbin/mysql-5.6.41.R4/share/charsets/ |
+--------------------------+-------------------------------------------+
8 rows in set (0.04 sec)

mysql> select title from site where title like 'Maison H%';
+------------------+
| title            |
+------------------+
| Maison Hēroïne   |
+------------------+
1 row in set (0.21 sec)

mysql> show create table site;
| site  | CREATE TABLE `site` (
...
  `title` varchar(255) CHARACTER SET utf8 NOT NULL DEFAULT '',
...
) ENGINE=InnoDB AUTO_INCREMENT=524892 DEFAULT CHARSET=utf8 COLLATE=utf8_bin |

You have "Mojibake"; see that in https://stackoverflow.com/questions/38363566/trouble-with-utf-8-characters-what-i-see-is-not-what-i-stored — Rick James, Nov 01 '19 at 15:58
I'm not sure it's an exact duplicate, since my question is how to achieve the prior Python 2.7 behavior under Python 3.5 without modifying the database. So far I've tried mysqlclient and mysql-connector-python and they behave the same. The auto-negotiated session settings cause the adapter to return incorrect str objects. Manually setting character_set_result to latin1 and then decoding the byte sequences in python works, but that's unwieldy. I was hoping to find a way to achieve the desired behavior by passing certain values to connect(). — jph, Nov 02 '19 at 16:25
I added debug to the pure python Oracle adapter. When I "set names 'latin1'" I get back valid utf8 bytes. However, the adapter then tries to decode them as 'latin1', which generates an incorrect string. I think the issue is that my table/column are utf8 but the server/database are both latin1. When character_set_result is set to something other than latin1, some conversion is happening on the server side that's mangling the bytes. — jph, Nov 05 '19 at 18:29
Let's see `SELECT col, HEX(col) ...` to see if it was stored wrong. One of the scenarios is effectively "two wrongs make a right"; it sounds like you are there. — Rick James, Nov 05 '19 at 21:50
Am coming back to this super late, but here's what worked as a temp fix without doing anything special to the adapter config: CONVERT(CONVERT(CONVERT(column USING latin1) USING binary) USING utf8) The HEX() of the column (without any conversion) was: 4D6169736F6E2048C384E2809C726FC383C2AF6E65 — jph, Jan 24 '20 at 17:12
That's "double encoding" of `Maison Hēroïne`. This is perhaps the "simplest" fix: `CONVERT(BINARY(CONVERT(CONVERT(UNHEX('4D6169736F6E2048C384E2809C726FC383C2AF6E65') USING utf8mb4) USING latin1)) USING utf8mb4)` -->`Maison Hēroïne` — Rick James, Jan 24 '20 at 18:27
Is that preferable to CONVERT(CONVERT(CONVERT(column USING latin1) USING binary) USING utf8)? We're currently recommending to the db aministrator to change the default system encoding to utf8. That addressed a similar issue on another MySQL db, but I suppose it may or may not be relevant here. — jph, Jan 26 '20 at 22:35
Plan A: Reload the data correctly. Plan B: try to recover the messed up data with that convert. (You would still need to change the settings so that future data would note be messed up.) — Rick James, Jan 26 '20 at 22:54
Ah, so changing the system default won't be a magic fix. Bummer. However, this data is updated relatively frequently, so if we have them change the default then new data, when written, may be correct? — jph, Jan 26 '20 at 23:36
_If_ you get all the settings correct, then all future actions will be "correct". Check the things in the 'dup' Q&A. If necessary, start a new question with the details of all the settings (`SHOW..`, connection params, `CREATE ..`, etc, and `HEX(col)`, etc). We can go through things methodically. This Comment list is getting too long. — Rick James, Jan 27 '20 at 01:59

unicode values not being encoded correctly

0 Answers0