14

I have to redesign a class where (amongst other things) UTF-8 strings are double-encoded wrongly:

$string = iconv('ISO-8859-1', 'UTF-8', $string);
:
$string = utf8_encode($string);

These faulty strings have been saved into multiple table fields all over a MySQL database. All fields being affected use collation utf8_general_ci.

Usually I'd setup a little PHP patch script, looping thru the affected tables, SELECTing the records, correct the faulty records by using utf8_decode() on the double-encoded fields and UPDATE them.

As I got many and huge tables this time, and the error only affects german umlauts (äöüßÄÖÜ), I'm wondering if there's a solution smarter/faster than that.

Are pure MySQL solutions like the following safe and recommendable?

 UPDATE `table` SET `col` = REPLACE(`col`, 'ä', 'ä');

Any other solutions/best practices?

Jürgen Thelen
  • 12,745
  • 7
  • 52
  • 71

6 Answers6

21

Alter the table to change the column character set to Latin-1. You will now have singly-encoded UTF-8 strings, but sitting in a field whose collation is supposed to be Latin-1.

What you do then is, change the column character set back to UTF-8 via the binary character set - that way MySQL doesn't convert the characters at any point.

ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET latin1
ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET binary
ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET utf8

(is the correct syntax iirc; put the appropriate column type in where ... is)

Hammerite
  • 21,755
  • 6
  • 70
  • 91
  • D'oh, .. this one literally `hammered` me out of my chair (pun intended^^)! Thank you for pulling this out of your magic bag. It worked like a charm and disburdened me of any worries about "out of memory" issues. I only had to use `CHANGE` instead of `MODIFY`. Btw, I noticed that switching to charset `latin1` implicitely switched collation to `latin1_swedish_ci`. Would it be more safe to append `COLLATE latin1_swedish_ci` to force this? Same goes for `binary` (switched collation to `none`) and `utf8` (switched collation back to `utf8_general_ci`). Thank you again. You saved my night^^ – Jürgen Thelen May 10 '11 at 22:01
  • I think each time you change the character set, if you don't specify a collation then the collation will be the default collation for the new character set. The collation shouldn't matter for the first two changes, because you are only temporarily using that character set, and the collation has no bearing AFAIK on how characters are stored in the table or which characters are stored. Only the encoding matters in those two changes. So you should only ever need to specify a collation for the final change, the one where you change it back to UTF-8. But if I am wrong I welcome being corrected. – Hammerite May 11 '11 at 11:02
  • Thank you for further elaborating on this. Found a page in the MySQL documentation confirming your thoughts on a [default collation](http://dev.mysql.com/doc/refman/5.6/en/charset-table.html) fallback. Forcing a collation on the last switch only makes sense to me, too. Will do so. – Jürgen Thelen May 11 '11 at 12:37
  • Skip the first `ALTER` -- it is unnecessary and possibly will damage the data. – Rick James May 13 '15 at 21:03
  • If you skip the first ALTER command then the commands won't have any net effect. You will have just tied the MySQL server up for a little while making a change to a column definition and then changing it straight back again. Could you elaborate on your statement that the data might be "damaged"? What is the nature of the potential damage? – Hammerite May 13 '15 at 21:55
14

I tried the posted solutions, but my DB kept spitting up errors. Eventually I stumbled upon the following solution (in a forum I believe, but I can't remember where):

UPDATE table_name SET col_name = CONVERT(CONVERT(CONVERT(col_name USING latin1) USING binary) USING utf8);

and it worked a treat. Hope this helps anyone who stumbled here from desperate google searching like me.

NOTE: This is of course assuming your double encoded character issues originate from an overly helpful MySQL conversion from latin1 to utf8, but I believe that's where most of these "corrupted characters" happen. This basically does the same conversion as mentioned above back to latin1, then binary, then to utf8 (using the binary step as a way to prevent the re-encoding of the already encoded latin1 entities)

Owen Waring
  • 141
  • 1
  • 3
8

I found the following approach simpler:

mysqldump -h DB_HOST -u DB_USER -p --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql

Then drop all tables and re-import with following command:

mysql -h DB_HOST -u DB_USER -p --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

Tip was found at this URL: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

Ross
  • 81
  • 1
  • 1
  • This solution was more easy and much faster than the other solutions here. – coding Bott Aug 01 '13 at 09:01
  • This worked great... on double-encoded columns. It messed-up the properly-encoded columns I had though (don't ask), changing all Cyrillic to ???????????????????'s. Easy to fix by copying the newly recoded text back to the old table. Thanks. – ow3n Aug 19 '15 at 02:27
1

MySql is charset aware, so you can convert in SQL. But for this case, I would probably prefer to just script it in PHP, as it's a one-off task anyway.

Keep in mind that columns in MySql have a charset property. The collation is (in theory) orthogonal to the charset. While a utf8_general_ci collation would imply that the charset is utf8, it's not a given. You could in theory mix a utf8 collation with a latin1 encoding (And get garbage as a result).

If you decide to do this in SQL, look here:

http://dev.mysql.com/doc/refman/5.0/en/charset-convert.html

troelskn
  • 115,121
  • 27
  • 131
  • 155
  • To be honest, until you mentioned it in your 2nd paragraph, I always thought that collation `utf8_general_ci` __does__ mean charset `utf8` at the same time. +1 for making me read the MySQL basic documentation again after decades (more carefully this time^^). Big thank your for that. – Jürgen Thelen May 10 '11 at 21:43
0

MySQL provides a regexp match but no regexp replace, so you're usually better off iterating through each row in php, converting as needed, and updating the row if it has been changed.

Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154
0

Generate a dump using mysqldump, change the encoding declaration (it's in the first commands), and reload in another database.

You can also use iconv on your dump to transcode it.

You can SELECT INTO OUTFILE, massage the file using php or iconv, then LOAD DATA INFILE.

bobflux
  • 11,123
  • 3
  • 27
  • 27
  • I thought about to `utf8_decode` a whole dump, but discarded it. Simply because some tables are way too huge (1G+ each), which would force me to read and convert the whole dump line by line. Thanks anyway. – Jürgen Thelen May 10 '11 at 21:32