0

We're running Etherpad Lite and we're trying to migrate database from MySQL to PostgreSQL.

MySQL database 'value' column is of type utf8mb4. However, around 10% of all rows contain value that is in fact encoded in Windows-1252 or ISO-8859-15 instead of UTF-8. How is this possible? Does not MySQL validate the UTF-8 before entering it into the column?

PostgreSQL cannot accept the invalid values during migration because it does validate the data and hits e.g. raw byte 0xE4 (ISO-8859-15: ä) which should be encoded as byte sequence 0xC3 0xA4 in UTF-8.

Is this a known "feature" of MySQL? Is there any way to always get real UTF-8 from utf8mb4 column?

Mikko Rantalainen
  • 14,132
  • 10
  • 74
  • 112

2 Answers2

0

If

  • you say the client is using latin1 (etc), and
  • you say the column is utf8 (or utf8mb4), and
  • you provide hex E4

Then all is well. The E4 will be converted during the INSERT into C3A4 and that is what is stored. Do SELECT HEX(...) ... to verify.

If

  • you say the client is using utf8 (or utf8mb4), and
  • you say the column is utf8 (or utf8mb4), and
  • you provide hex C3A4

Again, all is well. The C3A4 goes directly into the table.

Here's a messy case:

If

  • you say the client is using latin1, and
  • you say the column is utf8 (or utf8mb4), and
  • but you provide hex C3A4

Then, MySQL is obligated to convert two characters (C3 and A4) into utf8, yielding C383C2A4. I call this "double encoding".

Follow the Best Practice in Trouble with UTF-8 characters; what I see is not what I stored and use its suggested way to test the data. Then come back with more details.

Probably the only way for 10% of the data to be mis-interpreted is for 10% of the data to be encoded differently. So, please provide hex for a 10% example and for a 90% example. And provide the hex in the client before inserting and in the table after it is inserted.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I have a case where client says `set names utf8` and database column is of type `utf8mb4` and `SELECT` still returns strings containing raw byte `\xE4` in place of `ä` or U+00E4. Note that UTF-8 should encode that character as `\xC3\xA4`. The data is originally inputted by node.js service in case that matters. – Mikko Rantalainen Sep 12 '17 at 07:41
  • Then you have "lied" about the bytes being utf8. Hex `E4` is latin1. `U+00E4` may look similar, but that is _unicode_, not latin1, nor utf8. Saying `set names utf8` is saying that the _client_ will have hex `C3A4` for `ä`. – Rick James Sep 12 '17 at 17:10
  • I agree that plain `0xE4` should not appear in UTF-8 string. However, Etherpad Lite (running on node.js) will sometimes emit raw Windows-1252 characters in the middle of UTF-8 strings (I think this is a bug in node.js or Etherpad Lite) and MySQL will happily store such string in the MyISAM table. If MySQL did thrown exception instead of accepting such invalid string I would not be asking this question. – Mikko Rantalainen Sep 13 '17 at 11:45
  • UTF-8 string may include byte `0xE4` if it's followed by other bytes with 8th bit set. For example, U+4494 is byte sequence `0xE4 0x92 0x94`. – Mikko Rantalainen Sep 13 '17 at 11:51
  • I have no idea how to reproduce the problem (insert the data to database) but the fact is that MySQL is returning part of the data from database with Windows-1252 encoding even though the connection is set to UTF-8 and database column type is `utf8mb4`. – Mikko Rantalainen Sep 13 '17 at 12:00
0

No solution is known. This is probably a bug in MySQL which should disallow storing non-UTF-8 data in case client connnection and column type are both utf8mb4.

I no longer use MySQL for anything so I haven't bothered to try to figure this bug any more. Nowadays, I'm using PostgreSQL for everything instead.

Mikko Rantalainen
  • 14,132
  • 10
  • 74
  • 112