How to correct double-encoded UTF-8 strings sitting in MySQL utf8_general_ci fields?

Question

I have to redesign a class where (amongst other things) UTF-8 strings are double-encoded wrongly:

$string = iconv('ISO-8859-1', 'UTF-8', $string);
:
$string = utf8_encode($string);

These faulty strings have been saved into multiple table fields all over a MySQL database. All fields being affected use collation utf8_general_ci.

Usually I'd setup a little PHP patch script, looping thru the affected tables, SELECTing the records, correct the faulty records by using utf8_decode() on the double-encoded fields and UPDATE them.

As I got many and huge tables this time, and the error only affects german umlauts (äöüßÄÖÜ), I'm wondering if there's a solution smarter/faster than that.

Are pure MySQL solutions like the following safe and recommendable?

 UPDATE `table` SET `col` = REPLACE(`col`, 'Ã¤', 'ä');

Any other solutions/best practices?

score 21 · Accepted Answer · answered May 10 '11 at 15:05

21

Alter the table to change the column character set to Latin-1. You will now have singly-encoded UTF-8 strings, but sitting in a field whose collation is supposed to be Latin-1.

What you do then is, change the column character set back to UTF-8 via the binary character set - that way MySQL doesn't convert the characters at any point.

ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET latin1
ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET binary
ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET utf8

(is the correct syntax iirc; put the appropriate column type in where ... is)

answered May 10 '11 at 15:05

Hammerite

21,755
6
70
91

D'oh, .. this one literally `hammered` me out of my chair (pun intended^^)! Thank you for pulling this out of your magic bag. It worked like a charm and disburdened me of any worries about "out of memory" issues. I only had to use `CHANGE` instead of `MODIFY`. Btw, I noticed that switching to charset `latin1` implicitely switched collation to `latin1_swedish_ci`. Would it be more safe to append `COLLATE latin1_swedish_ci` to force this? Same goes for `binary` (switched collation to `none`) and `utf8` (switched collation back to `utf8_general_ci`). Thank you again. You saved my night^^ – Jürgen Thelen May 10 '11 at 22:01
I think each time you change the character set, if you don't specify a collation then the collation will be the default collation for the new character set. The collation shouldn't matter for the first two changes, because you are only temporarily using that character set, and the collation has no bearing AFAIK on how characters are stored in the table or which characters are stored. Only the encoding matters in those two changes. So you should only ever need to specify a collation for the final change, the one where you change it back to UTF-8. But if I am wrong I welcome being corrected. – Hammerite May 11 '11 at 11:02
Thank you for further elaborating on this. Found a page in the MySQL documentation confirming your thoughts on a [default collation](http://dev.mysql.com/doc/refman/5.6/en/charset-table.html) fallback. Forcing a collation on the last switch only makes sense to me, too. Will do so. – Jürgen Thelen May 11 '11 at 12:37
Skip the first `ALTER` -- it is unnecessary and possibly will damage the data. – Rick James May 13 '15 at 21:03
If you skip the first ALTER command then the commands won't have any net effect. You will have just tied the MySQL server up for a little while making a change to a column definition and then changing it straight back again. Could you elaborate on your statement that the data might be "damaged"? What is the nature of the potential damage? – Hammerite May 13 '15 at 21:55

Owen Waring · Answer 2 · 2011-12-01T21:08:18.983

I tried the posted solutions, but my DB kept spitting up errors. Eventually I stumbled upon the following solution (in a forum I believe, but I can't remember where):

UPDATE table_name SET col_name = CONVERT(CONVERT(CONVERT(col_name USING latin1) USING binary) USING utf8);

and it worked a treat. Hope this helps anyone who stumbled here from desperate google searching like me.

NOTE: This is of course assuming your double encoded character issues originate from an overly helpful MySQL conversion from latin1 to utf8, but I believe that's where most of these "corrupted characters" happen. This basically does the same conversion as mentioned above back to latin1, then binary, then to utf8 (using the binary step as a way to prevent the re-encoding of the already encoded latin1 entities)

score 8 · Answer 3 · answered Feb 01 '12 at 01:49

8

I found the following approach simpler:

mysqldump -h DB_HOST -u DB_USER -p --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql

Then drop all tables and re-import with following command:

mysql -h DB_HOST -u DB_USER -p --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

Tip was found at this URL: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

answered Feb 01 '12 at 01:49

Ross

81
1
1

This solution was more easy and much faster than the other solutions here. – coding Bott Aug 01 '13 at 09:01
This worked great... on double-encoded columns. It messed-up the properly-encoded columns I had though (don't ask), changing all Cyrillic to ???????????????????'s. Easy to fix by copying the newly recoded text back to the old table. Thanks. – ow3n Aug 19 '15 at 02:27

score 1 · Answer 4 · answered May 10 '11 at 14:50

MySql is charset aware, so you can convert in SQL. But for this case, I would probably prefer to just script it in PHP, as it's a one-off task anyway.

Keep in mind that columns in MySql have a charset property. The collation is (in theory) orthogonal to the charset. While a utf8_general_ci collation would imply that the charset is utf8, it's not a given. You could in theory mix a utf8 collation with a latin1 encoding (And get garbage as a result).

If you decide to do this in SQL, look here:

http://dev.mysql.com/doc/refman/5.0/en/charset-convert.html

To be honest, until you mentioned it in your 2nd paragraph, I always thought that collation `utf8_general_ci` __does__ mean charset `utf8` at the same time. +1 for making me read the MySQL basic documentation again after decades (more carefully this time^^). Big thank your for that. — Jürgen Thelen, May 10 '11 at 21:43

score 0 · Answer 5 · answered May 10 '11 at 14:48

0

MySQL provides a regexp match but no regexp replace, so you're usually better off iterating through each row in php, converting as needed, and updating the row if it has been changed.

answered May 10 '11 at 14:48

Denis de Bernardy

75,850
13
131
154

score 0 · Answer 6 · answered May 10 '11 at 15:11

0

Generate a dump using mysqldump, change the encoding declaration (it's in the first commands), and reload in another database.

You can also use iconv on your dump to transcode it.

You can SELECT INTO OUTFILE, massage the file using php or iconv, then LOAD DATA INFILE.

answered May 10 '11 at 15:11

bobflux

11,123
3
27
27

I thought about to `utf8_decode` a whole dump, but discarded it. Simply because some tables are way too huge (1G+ each), which would force me to read and convert the whole dump line by line. Thanks anyway. – Jürgen Thelen May 10 '11 at 21:32

How to correct double-encoded UTF-8 strings sitting in MySQL utf8_general_ci fields?

6 Answers6

Linked