How to fix double-encoded UTF8 characters (in an utf-8 table)

Question

A previous LOAD DATA INFILE was run under the assumption that the CSV file is latin1-encoded. During this import the multibyte characters were interpreted as two single character and then encoded using utf-8 (again).

This double-encoding created anomalies like ÃƒÂ± instead of ñ.

How to correct these strings?

@Esailija It is not a MySQL function. It can be solved withoput bringing tools like PHP into the picture. (The question was created to be self-answered, but if a better solution comes up I will accept it instead of mine). — vbence, Jul 11 '12 at 15:59
good to know, mark this as favorite so i can find it when i going to need it — Puggan Se, Jul 11 '12 at 16:02

score 123 · Accepted Answer · answered Jul 11 '12 at 15:56

123

The following MySQL function will return the correct utf8 string after double-encoding:

CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8)

It can be used with an UPDATE statement to correct the fields:

UPDATE tablename SET
    field = CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8);

answered Jul 11 '12 at 15:56

vbence

20,084
9
69
118

1

I got this working mostly, but found a sequence that does not work: letter `ě` is `C49B` but appears in my database as `c384c29b` and `SELECT HEX(CONVERT(CAST(0xc384c29b AS CHAR) USING latin1))` got the invalid UTF-8 byte sequence `C43F` which means your outermost `CONVERT` does not work. UTF-8 bytes `c29b` should be Unicode `9B` but MySQL is setting it to `3F` (`?`) presumably because this is a control character in latin1. Perl's utf8::decode worked with it though. – hood Nov 11 '14 at 05:00
8

I can't refrain from saying how happy I am I've found this solution :) – Attila Fulop Oct 05 '15 at 14:51
Great. This seems to work. Now my challenge is to work out whether only *some* records have this encoding error, or all of them. I might need to only search for records with incorrect characters. – Simon East Jul 15 '16 at 07:34
WoW ... i'm try to accomplish this via PHP but you saved my times. 100% worked in my database. – Abbas Oct 14 '16 at 12:10
Thank you! Everyone should see [Eric’s enhancement](https://stackoverflow.com/a/38877330/89818) which prevents data loss due to `null` results (which you could get if a specific column was already non-ASCII `utf8`, i.e. not `latin1`). – caw Jan 16 '19 at 03:52
If your field has correct UTF8 entries mixed in your table this will fail on those columns and get an 1300 error "Invalid utf8 character string". You can make the update row by row and skip over the rows that throw this error. – Dean Or Feb 14 '19 at 22:07
Truncation can possibly occur especially if the string is long so to be safe you can do `UPDATE table SET title = @txt WHERE char_length(title) = LENGTH(@txt := CONVERT(BINARY CONVERT(title USING latin1) USING utf8));` Found [here](https://coderwall.com/p/gjyuwg/mysql-convert-encoding-to-utf8-without-garbled-data). This does mean you will have lingering rows that aren't converted however, – Dean Or Feb 14 '19 at 23:39
@DeanOr I don't think the parsed UTF8 string can be any longer than the bit stream stored as individual characters. If anything it can become shorter. – vbence Feb 15 '19 at 12:49
@vbence you are right, test case I was running was running inserted values as latin1 encoding which was longer than the column could hold and became truncated at that point and not at conversion. – Dean Or Feb 22 '19 at 20:06
I have a solution for the 1300 error by adding an `IGNORE` clause. But this results in truncated text for example `¡Felicidades` will become `Felicidades`. Basically this string is already UTF8 and converting to latin1 back to utf8 causes data loss. You can add the following condition to skip over these rows if the raw bytes become changed `HEX(CONVERT(title USING latin1)) = HEX(CONVERT(CONVERT(CONVERT(title USING latin1) USING BINARY) USING utf8))` – Dean Or Feb 25 '19 at 18:53

score 19 · Answer 2 · answered Aug 10 '16 at 15:12

19

The above answer worked for some of my data, but resulted in a lot of NULL columns after running. My thought is if the conversion wasn't successful it returns null. To avoid that, I added a small check.

UPDATE
    tbl

SET
    col =
    CASE
        WHEN CONVERT(CAST(CONVERT(col USING latin1) AS BINARY) USING utf8) IS NULL THEN col
        ELSE CONVERT(CAST(CONVERT(col USING latin1) AS BINARY) USING utf8)
    END

answered Aug 10 '16 at 15:12

Eric

311
2
7

2

Saved my life! Perfect!! Thanks! – Toastor Apr 23 '17 at 17:48
3

A really important addition to prevent data loss, thanks! You could simplify this further by replacing `CASE WHEN converted IS NULL THEN original ELSE converted END` with `IF(converted IS NULL, original, converted)` or even `IFNULL(converted, original)`. – caw Jan 16 '19 at 03:58

score 5 · Answer 3 · answered Nov 05 '19 at 09:22

5

well it is very important to use "utf8mb4" instead of "utf8" since mysql will strip out all the data after an unrecognized character. So the safer method is;

UPDATE tablename SET
field = CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8mb4);

be careful about this.

answered Nov 05 '19 at 09:22

burkul

113
1
5

also if your collation is not utf8mb4_unicode_ci , first change this : ALTER TABLE `tablename` modify `fieldname` type CHARACTER SET `utf8mb4` COLLATE `utf8mb4_unicode_ci` – burkul Nov 05 '19 at 09:26
This is an old thread but I had exactly same problem and your answer saved my day! Thank you so much! – Sudhir Jul 28 '22 at 06:44

Stéphane Millien · Answer 4 · 2019-09-16T10:03:44.887

I meet this issue too, here a solution for Oracle:

update tablename t set t.colname = convert(t.colname, 'WE8ISO8859P1', 'UTF8') where t.colname like '%Ã%'

And another one for Java:

public static String fixDoubleEncoded(String text) {
    final Pattern pattern = Pattern.compile("^.*Ã[^0-9a-zA-Z\\ \t].*$");
    try {
        while (pattern.matcher(text).matches())
            text = new String(text.getBytes("iso-8859-1"), "utf-8");
    }
    catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }
    return text;
}

How to fix double-encoded UTF8 characters (in an utf-8 table)

4 Answers4

Linked

Related