49

I've got a database with a bunch of broken utf8 characters scattered across several tables. The list of characters isn't very extensive AFAIK (áéíúóÁÉÍÓÚÑñ)

Fixing a given table is very straightforward

update orderItem set itemName=replace(itemName,'á','á');

But I can't get a way of detecting the broken characters. If I do something like

SELECT * FROM TABLE WHERE field LIKE "%Ã%";

I get nearly all the fields because of the collation (Ã=a). All broken characters so far start with an "Ã". The database is in spanish so this particular character isn't used

The list of broken chars I've got so far is

á = á
é = é
í- = í
ó = ó
ñ = ñ
á = Á

Any idea of how to make this SELECT to work as intended? (a binary search or something like that)

The Disintegrator
  • 4,147
  • 9
  • 35
  • 43
  • Are you sure you’re using UTF-8? Those “broken characters” seem as you’re interpreting UTF-8 encoded data with ISO 8859-1. Because the character `á` is encoded with UTF-8 as 0xC3A1 and that represents the characters `Ã` and `¡` in ISO 8859-1. – Gumbo Sep 25 '09 at 09:40
  • 7
    Don't look at me. My work is to fix this db. I didn't broke it. – The Disintegrator Sep 26 '09 at 00:26
  • This is called "Mojibake" and is discussed [_here_](https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored) – Rick James Nov 29 '18 at 00:25

17 Answers17

88

I fixed with

UPDATE wp_zcs9ck_posts_copy SET post_title = 
    CONVERT(BINARY CONVERT(post_title USING latin1) USING utf8);

Complete solution: http://jonisalonen.com/2012/fixing-doubly-utf-8-encoded-text-in-mysql/

Thales Ceolin
  • 2,594
  • 1
  • 21
  • 15
  • 4
    Backup your database first... this deleted everything but the first half sentence in my wp_posts -> post_content field! – HandiworkNYC.com Mar 23 '16 at 15:16
  • Worked for me as well. Consider using `SELECT name, CONVERT(BINARY CONVERT(name USING latin1) USING utf8) AS conv FROM table WHERE id IN (SELECT id FROM table WHERE LENGTH(name) != CHAR_LENGTH(name));` to find broken records and see the result before using `UPDATE`. – Szymon Sadło Sep 18 '16 at 17:52
  • 3
    I made unnecessary subquery in the comment above. Use this `SELECT name, CONVERT(BINARY CONVERT(name USING latin1) USING utf8) AS conv FROM table WHERE LENGTH(name) != CHAR_LENGTH(name)` instead – Szymon Sadło Sep 18 '16 at 18:03
  • This answer applies to double-encoding, not simple Mojibake. – Rick James Nov 29 '18 at 00:28
  • I have found that checking if CONVERT(BINARY CONVERT(bundle_display_name USING latin1) USING utf8) is not NULL before doing the update helps preventing already correctly encoded values getting broken (the test with length/char_length doesn't work well as "é" will be 3/1 but it's correct value "é" is still 2/1) – GregOriol Oct 19 '20 at 12:57
41
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'á','á');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ä','ä');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'é','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í©','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ó','ó');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íº','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ú','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ñ','ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í‘','Ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã','í');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'–','–');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'’','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'…','...');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'–','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'“','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‘','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'•','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‡','c');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Â','');
hellodaniel
  • 154
  • 7
  • What about processing the .sql file with a scripting language like Perl and replace all the "wrong" characters? In Perl it can be easily done with something like this: $content =~ s/á/á/g; Repeat the same sentence for the rest of the characters. $content is a string storing the whole file. – Diego Pino Oct 31 '11 at 16:57
  • 1
    This worked for me, though I had a totally different character set, so had to do every character manually using Adam Lynch's trick(http://stackoverflow.com/a/6112241/807104), Only had to run queries 409 times, 'So I got that goin for me which is nice'. – Mohd Abdul Mujib Mar 26 '14 at 11:50
  • I've been trying to update a number of tables from other datasources for weeks using more comprehensive methods but this is by far the easiest. thanks! Was assembling this info myself today. – hellodaniel Feb 07 '15 at 05:38
22

Thanks for your answers!!

I fixed my tables with this, and wanted to share the full list of changes. Note that it also includes fixing html decoded characters, besides latin ones, it was really a mess:

(If you need more conversions, look them up at https://www.utf8-chartable.de/unicode-utf8-table.pl)

update `table` set `field` = replace(`field`, 'â€', '"');
update `table` set `field` = replace(`field`, '–', '–');
update `table` set `field` = replace(`field`, '•', '-');
update `table` set `field` = replace(`field`, '“', '"');

update `table` set `field` = replace(`field`, '¡', '¡');
update `table` set `field` = replace(`field`, '¢', '¢');
update `table` set `field` = replace(`field`, '£', '£');
update `table` set `field` = replace(`field`, '¤', '¤');
update `table` set `field` = replace(`field`, 'Â¥', '¥');
update `table` set `field` = replace(`field`, '¦', '¦');
update `table` set `field` = replace(`field`, '§', '§');
update `table` set `field` = replace(`field`, '¨', '¨');
update `table` set `field` = replace(`field`, '©', '©');
update `table` set `field` = replace(`field`, 'ª', 'ª');
update `table` set `field` = replace(`field`, '«', '«');
update `table` set `field` = replace(`field`, '¬', '¬');
# This one looks like it's missing a character, but it's there. 0xad
update `table` set `field` = replace(`field`, '­', '­');
update `table` set `field` = replace(`field`, '®', '®');
update `table` set `field` = replace(`field`, '¯', '¯');
update `table` set `field` = replace(`field`, '°', '°');
update `table` set `field` = replace(`field`, '±', '±');
update `table` set `field` = replace(`field`, '²', '²');
update `table` set `field` = replace(`field`, '³', '³');
update `table` set `field` = replace(`field`, '´', '´');
update `table` set `field` = replace(`field`, 'µ', 'µ');
update `table` set `field` = replace(`field`, '¶', '¶');
update `table` set `field` = replace(`field`, '·', '·');
update `table` set `field` = replace(`field`, '¸', '¸');
update `table` set `field` = replace(`field`, '¹', '¹');
update `table` set `field` = replace(`field`, 'º', 'º');
update `table` set `field` = replace(`field`, '»', '»');
update `table` set `field` = replace(`field`, '¼', '¼');
update `table` set `field` = replace(`field`, '½', '½');
update `table` set `field` = replace(`field`, '¾', '¾');
update `table` set `field` = replace(`field`, '¿', '¿');

update `table` set `field` = replace(`field`, 'À', 'À');
# This one looks like it's missing a character, but it's there. 0x81
update `table` set `field` = replace(`field`, 'Ã', 'Á');
update `table` set `field` = replace(`field`, 'Â', 'Â');
update `table` set `field` = replace(`field`, 'Ã', 'Ã');
update `table` set `field` = replace(`field`, 'Ä', 'Ä');
update `table` set `field` = replace(`field`, 'Ã…', 'Å');
update `table` set `field` = replace(`field`, 'Æ', 'Æ');
update `table` set `field` = replace(`field`, 'Ç', 'Ç');
update `table` set `field` = replace(`field`, 'È', 'È');
update `table` set `field` = replace(`field`, 'É', 'É');
update `table` set `field` = replace(`field`, 'Ê', 'Ê');
update `table` set `field` = replace(`field`, 'Ë', 'Ë');
update `table` set `field` = replace(`field`, 'ÃŒ', 'Ì');
# This one looks like it's missing a character, but it's there. 0x8d
update `table` set `field` = replace(`field`, 'Ã', 'Í');
update `table` set `field` = replace(`field`, 'ÃŽ', 'Î');
# This one looks like it's missing a character, but it's there. 0x8f
update `table` set `field` = replace(`field`, 'Ã', 'Ï');
# This one looks like it's missing a character, but it's there. 0x90
update `table` set `field` = replace(`field`, 'Ã', 'Ð');
update `table` set `field` = replace(`field`, 'Ñ', 'Ñ');
update `table` set `field` = replace(`field`, 'Ã’', 'Ò');
update `table` set `field` = replace(`field`, 'Ó', 'Ó');
update `table` set `field` = replace(`field`, 'Ô', 'Ô');
update `table` set `field` = replace(`field`, 'Õ', 'Õ');
update `table` set `field` = replace(`field`, 'Ö', 'Ö');
update `table` set `field` = replace(`field`, '×', '×');
update `table` set `field` = replace(`field`, 'Ø', 'Ø');
update `table` set `field` = replace(`field`, 'Ù', 'Ù');
update `table` set `field` = replace(`field`, 'Ú', 'Ú');
update `table` set `field` = replace(`field`, 'Û', 'Û');
update `table` set `field` = replace(`field`, 'Ü', 'Ü');
# This one looks like it's missing a character, but it's there. 0x9d
update `table` set `field` = replace(`field`, 'Ã', 'Ý');
update `table` set `field` = replace(`field`, 'Þ', 'Þ');
update `table` set `field` = replace(`field`, 'ß', 'ß');
update `table` set `field` = replace(`field`, 'à', 'à');
update `table` set `field` = replace(`field`, 'á', 'á');
update `table` set `field` = replace(`field`, 'â', 'â');
update `table` set `field` = replace(`field`, 'ã', 'ã');
update `table` set `field` = replace(`field`, 'ä', 'ä');
update `table` set `field` = replace(`field`, 'Ã¥', 'å');
update `table` set `field` = replace(`field`, 'æ', 'æ');
update `table` set `field` = replace(`field`, 'ç', 'ç');
update `table` set `field` = replace(`field`, 'è', 'è');
update `table` set `field` = replace(`field`, 'é', 'é');
update `table` set `field` = replace(`field`, 'ê', 'ê');
update `table` set `field` = replace(`field`, 'ë', 'ë');
update `table` set `field` = replace(`field`, 'ì', 'ì');
# This one looks like it's missing a character, but it's there. 0xad
update `table` set `field` = replace(`field`, '­­Ã', 'í');
update `table` set `field` = replace(`field`, 'î', 'î');
update `table` set `field` = replace(`field`, 'ï', 'ï');
update `table` set `field` = replace(`field`, 'ð', 'ð');
update `table` set `field` = replace(`field`, 'ñ', 'ñ');
update `table` set `field` = replace(`field`, 'ò', 'ò');
update `table` set `field` = replace(`field`, 'ó', 'ó');
update `table` set `field` = replace(`field`, 'ô', 'ô');
update `table` set `field` = replace(`field`, 'õ', 'õ');
update `table` set `field` = replace(`field`, 'ö', 'ö');
update `table` set `field` = replace(`field`, '÷', '÷');
update `table` set `field` = replace(`field`, 'ø', 'ø');
update `table` set `field` = replace(`field`, 'ù', 'ù');
update `table` set `field` = replace(`field`, 'ú', 'ú');
update `table` set `field` = replace(`field`, 'û', 'û');
update `table` set `field` = replace(`field`, 'ü', 'ü');
update `table` set `field` = replace(`field`, 'ý', 'ý');
update `table` set `field` = replace(`field`, 'þ', 'þ');
update `table` set `field` = replace(`field`, 'ÿ', 'ÿ');

update `table` set `field` = replace(`field` ,'ç','ç');
update `table` set `field` = replace(`field` ,'ã','ã');
update `table` set `field` = replace(`field` ,'á','á');
update `table` set `field` = replace(`field` ,'â','â');
update `table` set `field` = replace(`field` ,'é','é');
update `table` set `field` = replace(`field` ,'í','í');
update `table` set `field` = replace(`field` ,'õ','õ');
update `table` set `field` = replace(`field` ,'ú','ú');
update `table` set `field` = replace(`field` ,'ç','ç');
update `table` set `field` = replace(`field` ,'Á','Á');
update `table` set `field` = replace(`field` ,'Â','Â');
update `table` set `field` = replace(`field` ,'É','É');
update `table` set `field` = replace(`field` ,'Í','Í');
update `table` set `field` = replace(`field` ,'Õ','Õ');
update `table` set `field` = replace(`field` ,'Ú','Ú');
update `table` set `field` = replace(`field` ,'Ç','Ç');
update `table` set `field` = replace(`field` ,'Ã','Ã');
update `table` set `field` = replace(`field` ,'À','À');
update `table` set `field` = replace(`field` ,'Ê','Ê');
update `table` set `field` = replace(`field` ,'Ó','Ó');
update `table` set `field` = replace(`field` ,'Ô','Ô');
update `table` set `field` = replace(`field` ,'Ü','Ü');
update `table` set `field` = replace(`field` ,'ã','ã');
update `table` set `field` = replace(`field` ,'à','à');
update `table` set `field` = replace(`field` ,'ê','ê');
update `table` set `field` = replace(`field` ,'ó','ó');
update `table` set `field` = replace(`field` ,'ô','ô');
update `table` set `field` = replace(`field` ,'ü','ü');
update `table` set `field` = replace(`field` ,'&','&');
update `table` set `field` = replace(`field` ,'>','>');
update `table` set `field` = replace(`field` ,'&lt;','<');
update `table` set `field` = replace(`field` ,'&circ;','ˆ');
update `table` set `field` = replace(`field` ,'&tilde;','˜');
update `table` set `field` = replace(`field` ,'&uml;','¨');
update `table` set `field` = replace(`field` ,'&cute;','´');
update `table` set `field` = replace(`field` ,'&cedil;','¸');
update `table` set `field` = replace(`field` ,'&quot;','"');
update `table` set `field` = replace(`field` ,'&ldquo;','“');
update `table` set `field` = replace(`field` ,'&rdquo;','”');
update `table` set `field` = replace(`field` ,'&lsquo;','‘');
update `table` set `field` = replace(`field` ,'&rsquo;','’');
update `table` set `field` = replace(`field` ,'&lsaquo;','‹');
update `table` set `field` = replace(`field` ,'&rsaquo;','›');
update `table` set `field` = replace(`field` ,'&laquo;','«');
update `table` set `field` = replace(`field` ,'&raquo;','»');
update `table` set `field` = replace(`field` ,'&ordm;','º');
update `table` set `field` = replace(`field` ,'&ordf;','ª');
update `table` set `field` = replace(`field` ,'&ndash;','–');
update `table` set `field` = replace(`field` ,'&mdash;','—');
update `table` set `field` = replace(`field` ,'&macr;','¯');
update `table` set `field` = replace(`field` ,'&hellip;','…');
update `table` set `field` = replace(`field` ,'&brvbar;','¦');
update `table` set `field` = replace(`field` ,'&bull;','•');
update `table` set `field` = replace(`field` ,'&para;','¶');
update `table` set `field` = replace(`field` ,'&sect;','§');
update `table` set `field` = replace(`field` ,'&sup1;','¹');
update `table` set `field` = replace(`field` ,'&sup2;','²');
update `table` set `field` = replace(`field` ,'&sup3;','³');
update `table` set `field` = replace(`field` ,'&frac12;','½');
update `table` set `field` = replace(`field` ,'&frac14;','¼');
update `table` set `field` = replace(`field` ,'&frac34;','¾');
update `table` set `field` = replace(`field` ,'&#8539;','⅛');
update `table` set `field` = replace(`field` ,'&#8540;','⅜');
update `table` set `field` = replace(`field` ,'&#8541;','⅝');
update `table` set `field` = replace(`field` ,'&#8542;','⅞');
update `table` set `field` = replace(`field` ,'&gt;','>');
update `table` set `field` = replace(`field` ,'&lt;','<');
update `table` set `field` = replace(`field` ,'&plusmn;','±');
update `table` set `field` = replace(`field` ,'&minus;','−');
update `table` set `field` = replace(`field` ,'&times;','×');
update `table` set `field` = replace(`field` ,'&divide;','÷');
update `table` set `field` = replace(`field` ,'&lowast;','∗');
update `table` set `field` = replace(`field` ,'&frasl;','⁄');
update `table` set `field` = replace(`field` ,'&permil;','‰');
update `table` set `field` = replace(`field` ,'&int;','∫');
update `table` set `field` = replace(`field` ,'&sum;','∑');
update `table` set `field` = replace(`field` ,'&prod;','∏');
update `table` set `field` = replace(`field` ,'&radic;','√');
update `table` set `field` = replace(`field` ,'&infin;','∞');
update `table` set `field` = replace(`field` ,'&asymp;','≈');
update `table` set `field` = replace(`field` ,'&cong;','≅');
update `table` set `field` = replace(`field` ,'&prop;','∝');
update `table` set `field` = replace(`field` ,'&equiv;','≡');
update `table` set `field` = replace(`field` ,'&ne;','≠');
update `table` set `field` = replace(`field` ,'&le;','≤');
update `table` set `field` = replace(`field` ,'&ge;','≥');
update `table` set `field` = replace(`field` ,'&there4;','∴');
update `table` set `field` = replace(`field` ,'&sdot;','⋅');
update `table` set `field` = replace(`field` ,'&middot;','·');
update `table` set `field` = replace(`field` ,'&part;','∂');
update `table` set `field` = replace(`field` ,'&image;','ℑ');
update `table` set `field` = replace(`field` ,'&real;','ℜ');
update `table` set `field` = replace(`field` ,'&prime;','′');
update `table` set `field` = replace(`field` ,'&Prime;','″');
update `table` set `field` = replace(`field` ,'&deg;','°');
update `table` set `field` = replace(`field` ,'&ang;','∠');
update `table` set `field` = replace(`field` ,'&perp;','⊥');
update `table` set `field` = replace(`field` ,'&nabla;','∇');
update `table` set `field` = replace(`field` ,'&oplus;','⊕');
update `table` set `field` = replace(`field` ,'&otimes;','⊗');
update `table` set `field` = replace(`field` ,'&alefsym;','ℵ');
update `table` set `field` = replace(`field` ,'&oslash;','ø');
update `table` set `field` = replace(`field` ,'&Oslash;','Ø');
update `table` set `field` = replace(`field` ,'&isin;','∈');
update `table` set `field` = replace(`field` ,'&notin;','∉');
update `table` set `field` = replace(`field` ,'&cap;','∩');
update `table` set `field` = replace(`field` ,'&cup;','∪');
update `table` set `field` = replace(`field` ,'&sub;','⊂');
update `table` set `field` = replace(`field` ,'&sup;','⊃');
update `table` set `field` = replace(`field` ,'&sube;','⊆');
update `table` set `field` = replace(`field` ,'&supe;','⊇');
update `table` set `field` = replace(`field` ,'&exist;','∃');
update `table` set `field` = replace(`field` ,'&forall;','∀');
update `table` set `field` = replace(`field` ,'&empty;','∅');
update `table` set `field` = replace(`field` ,'&not;','¬');
update `table` set `field` = replace(`field` ,'&and;','∧');
update `table` set `field` = replace(`field` ,'&or;','∨');
update `table` set `field` = replace(`field` ,'&crarr;','↵');
Greg Schmidt
  • 5,010
  • 2
  • 14
  • 35
acseven
  • 221
  • 2
  • 2
  • 1
    That was the only solution that worked to me! The binary thing didnt work, even in a SELECT. Thanx! – Lennon Jan 13 '14 at 13:25
  • 2
    This is a great list but the ordering is problematic because by the time the search for "•" occurs for example, the †part at the beginning will have been replaced. – danielpunkass Feb 26 '21 at 16:41
17

No text replacement is a universal solutions because you can forget some character. A more suitable fix for double converted characters is:

  1. convert back to latin1
  2. convert to binary
  3. convert to utf8

Like this:

alter table descriptions modify name VARCHAR(2000) character set latin1;
alter table descriptions modify name blob;
alter table descriptions modify name VARCHAR(2000) character set utf8;
David
  • 4,080
  • 1
  • 26
  • 34
  • 1
    Though now I detected that it didn't work completely. Text that had a ' character was cut off at that character after the 3 alter table commands. :-( – e-motiv Oct 22 '12 at 18:42
  • This answer worked well for me. I added a simple script to my class collection, which will do this for all tables in a database. Instruction can be found here: http://www.os-cms.net/blog/view/34/Fix-utf8-broken-characters-in-MySQL Just if someone would find it useful. – dennis Oct 09 '15 at 13:37
  • This truncated columns data where non UTF 8 characters present and chinese characters replaced with question marks - Not a good solution for ones don't want to lose data. – Satishakumar Awati Feb 14 '18 at 06:42
  • Shorter to do it in 2 steps: `ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL; ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;` – Rick James Nov 29 '18 at 00:30
  • Wow. This is awesome. Save me a lot of headache. Would be nice to have a script that does this to all varchar columns in all tables inside a db. – Kobulniczky Csongor Nov 04 '21 at 10:47
16

The SELECT statement you need is the following:

SELECT * FROM TABLE WHERE LENGTH(name) != CHAR_LENGTH(name);

This returns all rows which contain multi-byte characters.

name is assumed to be a field / the field where weird characters would be found. *

Adam Lynch
  • 3,341
  • 5
  • 36
  • 66
15

This saved my life

UPDATE ohp_posts SET post_content = CONVERT(CAST(CONVERT(post_content USING latin1) AS BINARY) USING utf8)

I've found it here http://stanis.net/2014/04/replacing-latin-1-with-utf-8-characters-in-mysql/

Pablo S G Pacheco
  • 2,550
  • 28
  • 28
10

How about a different approach, namely converting the column back and forth to get the correct character set? You can convert it to binary, then to utf-8 and then to iso-8859-1 or whatever else you're using. See the manual for the details.

wds
  • 31,873
  • 11
  • 59
  • 84
  • 1
    the idea is to end up whit a utf-8 encoded db. right now the encoding a collation is utf-8 general. but apparently the application that use the db was interpreting as ISO8859. If I convert it back and forth, I will end with the same data... – The Disintegrator Sep 26 '09 at 00:29
  • well, only converting back and forth doesn't do magic, the characters stills broken. BUT in binary I can make a select looking for the à character. So now I have a mechanism to detect the broken chars. Thanks. – The Disintegrator Sep 26 '09 at 05:19
  • Okay, I remain convinced that there must be a way to use the conversion mechanism in a more general way but it might be more complex than first stated. Happy you found a solution that worked. – wds Sep 28 '09 at 08:30
  • If you mess up when altering column type, you may lose data. Instead with UPDATE queries, you can run UPDATE queries inside [a transaction](https://dev.mysql.com/doc/refman/5.7/en/commit.html), check the result, and commit only if this is ok or rollback if not. – dolmen Dec 11 '17 at 17:58
7

To convert all latin characters to correct accents try this on MySQL :

UPDATE your_table SET your_column = CONVERT(CAST(CONVERT(your_column USING latin1) AS BINARY) USING utf8)
Medhi
  • 2,656
  • 23
  • 16
3

I had this same problem but didn't like the replace() solution because there's always the possibility of missing some characters. I was working against a column with mixed data (some had been utf8_encode()d and some not) with 4 million or so rows, about 250k records with mis-encoded data (with É/etc characters), covering about 15 international languages, including mainly European languages but also Russian, Japanese and Chinese.

I started by duplicating the column, since I didn't want to lose any data:

ALTER TABLE images ADD COLUMN reptitle TEXT;

Copied all the data with multibyte characters (thanks Adam for the tip)

UPDATE images SET reptitle = title WHERE LENGTH(title) != CHAR_LENGTH(title)

Since reptitle was created with the table's default character set it was already utf8, but contained the corrupted data since images table used to be a latin source. Column reptitle now contains some data which is correctly encoded, and some corrupted (all values with multibyte characters, some had been correctly utf8_encode()d. So then with David's tip...

ALTER TABLE images MODIFY reptitle TEXT character set latin1;
ALTER TABLE images MODIFY reptitle BLOB;
ALTER TABLE images MODIFY reptitle TEXT character set utf8;

The middle step may not have been necessary since TEXT and BLOB (I think) are the same. This had the effect of correcting all incorrectly encoded data ('étudiantes' became 'étudiantes', etc) but data which was previously correct was truncated at the first multibyte character ('Lapin de Pâques' became 'Lapin de P'). I don't know why the truncation, but it's in a disposable column so I didn't care. The truncated data gives CHAR_LENGTH and LENGTH of the same values because there are no multi-byte characters remaining so easy query...

UPDATE images SET title = reptitle WHERE LENGTH(reptitle)!=CHAR_LENGTH(reptitle)

Then of course just drop the spare column

ALTER TABLE images DROP COLUMN reptitle

Also make sure (since I use PHP and this had tripped me up a couple of times so I thought I'd mention it here) all your script files are UTF8 (without BOM) and you are using:

mysql_set_charset('utf8', $connection);

Et voilà... perfectly repaired data, all languages :)

Bob Davies
  • 2,229
  • 1
  • 19
  • 28
  • Thanks for this - the method of adding a column and then copying them across is a good one, and laying it out like this is helpful. – cincodenada Nov 12 '13 at 21:48
2

In addition to Raúl Avila Solano and acseven's answer if you want to update all the broken characters in one query you can do:

update `table` set field = replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(field,'&uuml;','ü'),'&ocirc;','ô'),'&oacute;','ó'),'&ecirc;','ê'),'&agrave;','à'),'&atilde;','ã'),'&Uuml;','Ü'),'&Ocirc;','Ô'),'&Oacute;','Ó'),'&Ecirc;','Ê'),'&Agrave;','À'),'&Atilde;','Ã'),'&Ccedil;','Ç'),'&Uacute;','Ú'),'&Otilde;','Õ'),'&Iacute;','Í'),'&Iacute;','Í'),'&Eacute;','É'),'&Acirc;','Â'),'&Aacute;','Á'),'&ccedil;','ç'),'&uacute;','ú'),'&otilde;','õ'),'&iacute;','í'),'&eacute;','é'),'&acirc;','â'),'&aacute;','á'),'&atilde;','ã'),'&ccedil;','ç'),'à ','à'),'à ','à'),'º','º'),'ª','ª'),'ç','ç'),'–','–'),'ó','ó'),'é','é'),'á','á'),'ê','ê'),'ã','ã'),'â','â'),'í','í'),'õ','õ'),'Ø','Ø'),'•','-'),'ú','ú'),'à ','À'),'Ã','Ã'),'Ç','Ç'),'â€','"'),'“','"'),'É','É');
RafaSashi
  • 16,483
  • 8
  • 84
  • 94
2

this also solved my problem on some italian chars

UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'á','á');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ä','ä');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'é','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í©','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ó','ó');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íº','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ú','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ñ','ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í‘','Ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã','í');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'–','–');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'’','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'…','...');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'–','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'“','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‘','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'•','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‡','c');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Â','');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í ','à');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í¨','è');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íˆ','È');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'€','€');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'eÌ€','è');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í²','ò');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í¹','ù');
Alberto
  • 21
  • 2
2

You might have rows with properly encoded UTF8 and with wrongly encoded characters. In this case "CONVERT(BINARY CONVERT(post_title USING latin1) USING utf8)" will trim some fields.

I ended up doing it this way

update `table` set `name` = replace(`name` ,CONVERT(BINARY "ä" USING latin1),'ä');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "ö" USING latin1),'ö');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "ü" USING latin1),'ü');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ä" USING latin1),'Ä');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ö" USING latin1),'Ö');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ü" USING latin1),'Ü');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "ß" USING latin1),'ß');
1

This is an extension of @Thales Ceolin's answer in order to modify every table in the db:

select concat(
    "update ", 
    a.TABLE_NAME, 
    " set ", b.COLUMN_NAME, 
    " = CONVERT(BINARY CONVERT(", 
    b.COLUMN_NAME, 
    " USING latin1) USING utf8) where ",
    b.COLUMN_NAME, 
    " is not null;") query
from INFORMATION_SCHEMA.TABLES a
left join INFORMATION_SCHEMA.COLUMNS b on a.TABLE_NAME = b.TABLE_NAME
where a.table_schema = 'db_name'
and a.TABLE_TYPE = 'BASE TABLE'
and b.data_type in ('text', 'varchar')
and a.TABLE_NAME = 'table_name';

This will result in:

update table_name set idn = CONVERT(BINARY CONVERT(idn USING latin1) USING utf8) where idn is not null;
update table_nameset name = CONVERT(BINARY CONVERT(name USING latin1) USING utf8) where name is not null;
update table_name set primary_last_name = CONVERT(BINARY CONVERT(primary_last_name USING latin1) USING utf8) where primary_last_name is not null;
manix
  • 14,537
  • 11
  • 70
  • 107
1

Based on data in this post https://www.i18nqa.com/debug/utf8-debug.html I'd suggest this is a good query of identifying dodgy entries and possible correct values:

SELECT my_field,CONVERT(BINARY CONVERT(my_field USING latin1) USING utf8mb4) AS new_field_value FROM my_table WHERE my_field REGEXP '[âÆËÅÂÃ]';

Be very careful because we had a bad encoding of a file name, but an OK encoding of the path, and in that case some of the solutions above would have caused a world of pain. If some of your data is already correctly encoded in UTF8 you'll likely find you lost a chunk of it.

PeteW
  • 681
  • 6
  • 9
0

The middle step may not have been necessary since TEXT and BLOB are the same.

This had the effect of correcting all incorrectly encoded data but data which was previously correct was truncated at the first multibyte character.

Mihai Iorga
  • 39,330
  • 16
  • 106
  • 107
jams
  • 1
0

As the main question was on detecting broken chars my solution: (To prevent normal charset double encoding)

  1. Detection ( latin1 to utf8 )
SELECT name FROM %table% 
 WHERE 
CONVERT(CONVERT(name USING BINARY) USING utf8 ) != CONVERT(CONVERT(CONVERT(CONVERT(name USING BINARY) USING latin1) USING BINARY) USING utf8);
  1. Update( latin1 to utf8 )
UPDATE %table% SET name = convert(cast(convert(name using latin1 ) as binary) using utf8 )
 WHERE 
CONVERT(CONVERT(name USING BINARY) USING utf8 ) != CONVERT(CONVERT(CONVERT(CONVERT(name USING BINARY) USING latin1) USING BINARY) USING utf8);
Pavel Budo
  • 23
  • 1
  • 6
0

This query helped me identify rows that had bad characters in them. Basically you identify where the field isn't null, then convert to UTF8 and check for null after conversion.

select ach.*
from ach_warehouse ach
where addendum is not null and convert(addendum using utf8) is null;
Uncle Iroh
  • 5,748
  • 6
  • 48
  • 61