2

(I accidentially asked this question first on DBA but found out in the FAQ to ask SQL questions here on SO instead)

I'm trying to "repair" a MySQL database (UTF-8 collation) that was imported from an encoding-broken SQL text file.

I.e. the German umlauts are broken inside the database. What should read e.g. "ü" is displayed as "ü".

Since I'm aware of the REPLACE function, I first tried the obvious:

SELECT * FROM `mydb`.`mytable` WHERE `mycolumn` LIKE '%ü%';

This gave me the expected results.

Next I tried to replace them:

UPDATE `mydb`.`mytable` SET `mycolumn` = REPLACE(`mycolumn`, 'ü', 'ü');

To my surprise, this resulted in zero rows affected.

So even knowing a lot about encoding, I still cannot figure out how to solve this (or whether it is possible at all).

My question:

How to use the REPLACE function to replace broken German umlauts?

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
  • 1
    I assume you can't reimport with the proper encoding? What tool are you using to make the queries? – Pekka Feb 25 '13 at 15:23
  • 1
    Can you re-insert the data with "SET NAMES utf8"? – zanlok Feb 25 '13 at 15:23
  • 1
    possible duplicate of [How to correct double-encoded UTF-8 strings sitting in MySQL utf8\_general\_ci fields?](http://stackoverflow.com/questions/5951871/how-to-correct-double-encoded-utf-8-strings-sitting-in-mysql-utf8-general-ci-fie) – Fabian Schmengler Feb 25 '13 at 15:27
  • @fab This seems indeed to be the correct answer. Never thought of this from the title. Thank you! – Uwe Keim Feb 25 '13 at 15:33
  • @The Closers: Wait another few minutes, seems that this does not yet work as expected :-) – Uwe Keim Feb 25 '13 at 15:43
  • What's the matter? Have you tried Owen Waring's answer, too? – fancyPants Feb 25 '13 at 16:28
  • @tombom Yes! If I try Owen Waring's answer in HeidiSQL, it says "0 records affected". I really do not understand this. – Uwe Keim Feb 25 '13 at 16:46
  • I've tried it again, now it works for several records and then prints **SQL Error (1300): Invalid utf8 character string: 'FC7220'** – Uwe Keim Feb 25 '13 at 17:03

3 Answers3

3

To answer my own question, the linked possible duplicate pointed me into some direction but was not the right solution for me.

Instead, I did the following steps (using HeidiSQL):

  1. Right-clicked the table in the tree.
  2. Selected "Edit"
  3. Changed the drop-down value of "Default collation" to "latin_german1_ci". Did not tick the "Convert data" checkbox.
  4. Clicked the "Save" button at the bottom.
  5. Changed the drop-down value of "Default collation" to "utf8_general_ci". Did not tick the "Convert data" checkbox.
  6. Clicked the "Save" button at the bottom.

After that, I could successfully execute my statements like e.g.

UPDATE `mydb`.`mytable` SET `mycolumn` = REPLACE(`mycolumn`, 'ü', 'ü');
Community
  • 1
  • 1
Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
2

Your 'Umlauts' are NOT broken! These are just encoded as utf-8 and so your 'u with two dots' results in a utf-8 2-byte sequence. There is nothing wrong about it. Your problem is that your viewer possibly displays in ANSI encoding (iso-8859-1) which results in a 1:1 relation between byte and character. However a viewer must "decode" utf-8 sequences to get the codepoint for the character, otherwise it will simply display the utf-8 start-byte as well as following-byte/s as the character it represents in ANSI. I bet your view is not configured to view utf-8 encoded text. Just configure it as appropriate and you're done.

Let your 'u with two dots' be utf-8 encoded, then every viewer worldwide can decode it's codepoint which is a unique unicode codepoint. Remember that if your 'u with two dots' is ANSI encoded, it will show possibly a different character when displayed with a different ANSI encoding, for example iso-8859-5.

brighty
  • 406
  • 3
  • 10
  • [I'm the guy linking to Joel's Unicode article](http://www.joelonsoftware.com/articles/Unicode.html) all the time, so I'm pretty sure that I know how this encoding stuff works ;-) – Uwe Keim Nov 29 '13 at 13:22
  • 1
    Thanks for the link to Joel's Unicode article, it is really well explained, especially when it comes to the utf-8 encoding part. I'd appreciate if the article would become completed by explaining how codepoints bigger than 0xFFFF (characters beyond the Basic Multilingual Plane BMP) fit into wchar-arrays e. g. how this is done in UTF-16 (the lovely surrogates). – brighty Dec 02 '13 at 11:08
1

IMHO, MySQL is correct in storing ü as ü using UTF-8 collation utf8_general_ci or better utf_unicode_ci. (Discussion on why to prefer utf8_unicode_ci)

Now, it matters how you decode your UTF-encoded SQL string before you print it on the frontend like HTML, Android or iOS.

HTML

In case of HTML, set utf-8 charset in <head> section of HTML page, German character ü will show correctly.

<head>
    <meta charset="UTF-8"/>
    ...
</head>

If doesn't, then in case of PHP wrap your variable in utf_decode() function before printing it.

Note: If you are not using PHP with MySQL, lookup for utf-8 decode function in your language of choice.

Android

In case of Android use

Html.fromHtml(String).toString();

iOS

In case of iOS use

(NSString *)stringByDecodingHTMLEntities;
Hamza Rashid
  • 1,329
  • 15
  • 22