How to clean data with special characters in MySQL

Question

How can one clean data that looks like this RÃƒÂ©ation, lÃ¢Â€Â™Oreal to look like this R'action and L'Oreal respectively in MySQL?

Thanks @Madhur. This is what i was looking for but somehow was not using the right words — tom sawyer, Aug 16 '18 at 12:48

Rick James · Accepted Answer · 2018-08-25T19:10:04.127

That looks like an example of "double encoding". It is where the right hand was talking utf8, but the left hand was listening for latin1. See Trouble with UTF-8 characters; what I see is not what I stored and See also http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases .

RÃƒÂ©ation -> Réation after undoing the double-encoding.

Yet you say R'action -- I wonder if you were typing é as e' or 'e?? I'm also going to assume you meant L’Oreal?? (Note the 'right single quote mark' instead of 'apostrophe'.)

First, we need to verify that it is actually an ordinary double-encoding.

SELECT col, HEX(col) FROM ... WHERE ...

should give you this for the hex for Réation:

52  E9       6174696F6E  -- latin1 encoding
52 C3A9      6174696F6E  -- utf8 encoding
52 C383 C2A9 6174696F6E  -- double encoding

(Ignore the spacing.)

If you got the third of those proceed with my Answer. If you get anything else, STOP! -- the problem is more complex than I thought.

Now, see if the double-encoding fix will fix it (before fixing it):

SELECT col, CONVERT(BINARY(CONVERT(CONVERT(
                  BINARY(CONVERT(col USING latin1)) USING utf8mb4)
                      USING latin1)) USING utf8mb4)
     FROM tbl;

You need to prevent it from happening and fix the data. Some of the following is irreversible; test it on a copy of the table!

Your case is: CHARACTER SET latin1, but have utf8/utf8mb4 bytes in it; leave bytes alone while fixing charset:

First, let's assume you have this declaration for tbl.col:

col VARCHAR(111) CHARACTER SET latin1 NOT NULL

Then to convert the column without changing the bytes:

ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;

Note: If you start with TEXT, use BLOB as the intermediate definition. (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)

Do that for each column in each table with the problem.

(In this discussion I don't distinguish between utf8mb4 and utf8. Most text is quite happy with either; Emoji and some Chinese need utf8mb4, not just utf8.)

from Comment

        CONVERT(UNHEX('C38EC2B2') USING utf8mb4) = 'β'  (Greek beta)
CONVERT(CONVERT(UNHEX('C38EC2B2') USING latin1) USING utf8mb4) = 'ÃŽÂ²'

My conclusion: First you had some misconfiguration. Then you applied one or more wrong fixes. You now have such a mess that I dare not try to help you unravel it. That is, the mess is on beyond simply "double encoding".

If possible, start over, being sure that some test data gets stored correctly before adding more data. If the data is bad not try to fix the data; back off and start over again. See the "best bractice" in "Trouble..." for getting set up correctly. I'll be around to help you interpret whether the hex you see in the tables is correct.

Thanks @Rick James.I have followed your instructions and can say my problem is double encoding.I got `52C383C2A964616374696F6E3A` after running `SELECT col, HEX(col) FROM ...` . The double encoding fix fixes the problem. However, i get lost on how to fix the data already inside the table. I altered the concerned table as suggested but and this solves the problem only when insering new double encoded text.My problem is to update the table with double encoded text so that text that appears like this `RÃ©daction` in the table changes to this `Rédaction`. — tom sawyer, Aug 25 '18 at 11:37
I have tried updating using this `update test set words = select words,convert(binary(convert(convert( binary(convert(words using latin1)) using utf8mb4) using latin1)) using utf8mb4) from test;` but i get `ERROR 1064 (42000)` — tom sawyer, Aug 25 '18 at 11:39
@tomsawyer - I need to think about whether you have a worse mess now -- I think you have to completely clean up the table schema and data before adding any new rows. Maybe it would help if you spelled out the steps you took and the current HEX. Please try to use copy & paste; retyping leads to miscopying, as I found out (React vs Redact, etc). — Rick James, Aug 25 '18 at 14:21
You are right, i actually have a far much worse mess. For example, i have rows with text that looks like this `seniorsÃ¢Â€Â™` which is should appear like this `seniors’` as well as this `RARÃŽÂ²` which should appear like this `RARβ`. When i run the update query you have suggested above the second word i.e.,`RARÃŽÂ²` i get ERROR 1366 (HY000): Incorrect string value: '\xCE\xB2', while `seniorsÃ¢Â€Â™`updates to NULL. In summary, my table contains far more obscure characters than quotations. — tom sawyer, Aug 25 '18 at 15:17
I spent years getting this far into being able to quickly deduce 5 different (and somewhat common) ways that utf8 gets garbled. Yours is well beyond those 5. I don't have years to spend debugging it. The first 5 come from (in general) doing one thing wrong in setting up utf8. I fear that doing _two_ things _sequentially_ wrong will lead to 5*5 or 5^5 cases. Yikes! (I added to my Answer.) — Rick James, Aug 25 '18 at 19:14
As you may have deduced, I have a tool that takes hex or utf8 and recursively experiments -- enumerating the few dozen things that can be mangled (or unmangled) from it. So, I can very quickly discover that `C38EC2B2` is double-encoding for `β` (and vice versa). And going a different direction can lead to `ÃŽÂ²`. But going between `ÃŽÂ²` and `β` is more complex than my tool can handle. — Rick James, Aug 25 '18 at 19:21
I will go by your advice of staring over again with test data to see if data gets inserted in the right way. Thank you for your help and also for offering to help in future — tom sawyer, Aug 25 '18 at 21:30

How to clean data with special characters in MySQL

1 Answers1