How to replace broken German Umlauts with MySQL REPLACE function?

Question

(I accidentially asked this question first on DBA but found out in the FAQ to ask SQL questions here on SO instead)

I'm trying to "repair" a MySQL database (UTF-8 collation) that was imported from an encoding-broken SQL text file.

I.e. the German umlauts are broken inside the database. What should read e.g. "ü" is displayed as "Ã¼".

Since I'm aware of the REPLACE function, I first tried the obvious:

SELECT * FROM `mydb`.`mytable` WHERE `mycolumn` LIKE '%Ã¼%';

This gave me the expected results.

Next I tried to replace them:

UPDATE `mydb`.`mytable` SET `mycolumn` = REPLACE(`mycolumn`, 'Ã¼', 'ü');

To my surprise, this resulted in zero rows affected.

So even knowing a lot about encoding, I still cannot figure out how to solve this (or whether it is possible at all).

My question:

How to use the REPLACE function to replace broken German umlauts?

I assume you can't reimport with the proper encoding? What tool are you using to make the queries? — Pekka, Feb 25 '13 at 15:23
possible duplicate of [How to correct double-encoded UTF-8 strings sitting in MySQL utf8\_general\_ci fields?](http://stackoverflow.com/questions/5951871/how-to-correct-double-encoded-utf-8-strings-sitting-in-mysql-utf8-general-ci-fie) — Fabian Schmengler, Feb 25 '13 at 15:27
@fab This seems indeed to be the correct answer. Never thought of this from the title. Thank you! — Uwe Keim, Feb 25 '13 at 15:33
@The Closers: Wait another few minutes, seems that this does not yet work as expected :-) — Uwe Keim, Feb 25 '13 at 15:43
What's the matter? Have you tried Owen Waring's answer, too? — fancyPants, Feb 25 '13 at 16:28
@tombom Yes! If I try Owen Waring's answer in HeidiSQL, it says "0 records affected". I really do not understand this. — Uwe Keim, Feb 25 '13 at 16:46
I've tried it again, now it works for several records and then prints **SQL Error (1300): Invalid utf8 character string: 'FC7220'** — Uwe Keim, Feb 25 '13 at 17:03

score 3 · Accepted Answer · edited May 23 '17 at 10:27

To answer my own question, the linked possible duplicate pointed me into some direction but was not the right solution for me.

Instead, I did the following steps (using HeidiSQL):

Right-clicked the table in the tree.
Selected "Edit"
Changed the drop-down value of "Default collation" to "latin_german1_ci". Did not tick the "Convert data" checkbox.
Clicked the "Save" button at the bottom.
Changed the drop-down value of "Default collation" to "utf8_general_ci". Did not tick the "Convert data" checkbox.
Clicked the "Save" button at the bottom.

After that, I could successfully execute my statements like e.g.

UPDATE `mydb`.`mytable` SET `mycolumn` = REPLACE(`mycolumn`, 'Ã¼', 'ü');

brighty · Answer 2 · 2013-11-29T13:16:10.227

2

Your 'Umlauts' are NOT broken! These are just encoded as utf-8 and so your 'u with two dots' results in a utf-8 2-byte sequence. There is nothing wrong about it. Your problem is that your viewer possibly displays in ANSI encoding (iso-8859-1) which results in a 1:1 relation between byte and character. However a viewer must "decode" utf-8 sequences to get the codepoint for the character, otherwise it will simply display the utf-8 start-byte as well as following-byte/s as the character it represents in ANSI. I bet your view is not configured to view utf-8 encoded text. Just configure it as appropriate and you're done.

Let your 'u with two dots' be utf-8 encoded, then every viewer worldwide can decode it's codepoint which is a unique unicode codepoint. Remember that if your 'u with two dots' is ANSI encoded, it will show possibly a different character when displayed with a different ANSI encoding, for example iso-8859-5.

edited Nov 29 '13 at 13:16

answered Nov 29 '13 at 13:09

brighty

406
3
10

[I'm the guy linking to Joel's Unicode article](http://www.joelonsoftware.com/articles/Unicode.html) all the time, so I'm pretty sure that I know how this encoding stuff works ;-) – Uwe Keim Nov 29 '13 at 13:22
1

Thanks for the link to Joel's Unicode article, it is really well explained, especially when it comes to the utf-8 encoding part. I'd appreciate if the article would become completed by explaining how codepoints bigger than 0xFFFF (characters beyond the Basic Multilingual Plane BMP) fit into wchar-arrays e. g. how this is done in UTF-16 (the lovely surrogates). – brighty Dec 02 '13 at 11:08

Hamza Rashid · Answer 3 · 2018-04-05T08:37:01.313

IMHO, MySQL is correct in storing ü as Ã¼ using UTF-8 collation utf8_general_ci or better utf_unicode_ci. (Discussion on why to prefer utf8_unicode_ci)

Now, it matters how you decode your UTF-encoded SQL string before you print it on the frontend like HTML, Android or iOS.

HTML

In case of HTML, set utf-8 charset in <head> section of HTML page, German character ü will show correctly.

<head>
    <meta charset="UTF-8"/>
    ...
</head>

If doesn't, then in case of PHP wrap your variable in utf_decode() function before printing it.

Note: If you are not using PHP with MySQL, lookup for utf-8 decode function in your language of choice.

Android

In case of Android use

Html.fromHtml(String).toString();

iOS

In case of iOS use

(NSString *)stringByDecodingHTMLEntities;

How to replace broken German Umlauts with MySQL REPLACE function?

3 Answers3