Are there hidden encoding errors that I need to fix in Latin 1 --> UTF-8?

Question

Do I still need to run a full latin1 to UTF 8 conversion on the text that looks completely fine?

I'm swapping forum software, and the old forum database used Latin1 encoding. The new forum database uses UTF8 encoding for tables.

It looks like the importer script did a straight copy from one table to another without trying to fix any encoding issues.

I've been manually fixing the visible errors using a find-and-replace based on the conversion info listed here: http://www.i18nqa.com/debug/utf8-debug.html

The rest of the text looks fine and is completely readable.

My limited understanding is that UTF-8 is backwards compatible with ASCII and Latin1 is mostly ASCII, so it's only the edge cases that are different and need to be updated.

So do I still need to run a full latin1 to UTF 8 conversion on the text that looks completely fine?

I'd rather not because I've changed some of the BB Code tags on a number of the fields after they were stored in UTF 8, so concerned that those updates would have stuck UTF8 characters in the middle of the Latin1 characters, and trying to do a full conversion on mixed character sets will just muck things up further.

Well, you're basically asking if cutting corners is fine here. The strict approach would be to decode from Latin1 and to encode to UTF-8 and only then to make changes to the content. Anything less than that has a risk of incorrect data. — Simeon Visser, Jan 23 '15 at 22:14
I wasn't sure because I thought UTF-8 is backwards compatible with ASCII and Latin1 is mostly ASCII, so it's only the edge cases that are different and need to be updated. — Jeff Widman, Jan 23 '15 at 22:23
It's best not to assume things about encodings and properties they have (think of them as black boxes). To convert A to B you decode from A and encode to B. — Simeon Visser, Jan 23 '15 at 22:32
possible duplicate of [How to detect UTF-8 characters in a Latin1 encoded column - MySQL](http://stackoverflow.com/questions/9304485/how-to-detect-utf-8-characters-in-a-latin1-encoded-column-mysql) — Jonathan Leffler, Jan 24 '15 at 01:36
Eh, the answer might be similar, but the question is different. I specifically read that question before I even asked this one. That one is about how to detect differences, this one is whether it's okay to take the shortcut... Functionally what I needed to know is "It's okay to take the shortcut as long as you convert all characters in this range." — Jeff Widman, Jan 24 '15 at 01:38

Jonathan Leffler · Accepted Answer · 2015-01-24T13:50:32.637

1

Any characters from ISO 8859-1 (Latin 1) in the range 0x80..0xFF need to be recoded as 2 bytes in UTF-8. The first byte is 0xC2 for 0x80..0xBF; the first byte is 0xC3 for 0xC0..0xFF. The second byte is derived from the original value from Latin 1 by setting the two most significant bits to 1 and 0. For the characters 0x80..0xBF, the value of the second byte is unchanged from Latin 1. If you were using 8859-15, you may have a few more complex conversions (the Euro symbol is encoded differently from other Latin 1 characters).

There are tools aplenty to assist. iconv is one such.

edited Jan 24 '15 at 13:50

answered Jan 23 '15 at 22:31

Jonathan Leffler

730,956
141
904
1,278

So if they are not in that range, they don't need to be converted? Can I find these characters using this query? `SELECT post_id, CONVERT(CONVERT(message USING BINARY) USING latin1) AS latin1, CONVERT(CONVERT(message USING BINARY) USING utf8) AS utf8 FROM xf_post WHERE CONVERT(message USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')` – Jeff Widman Jan 24 '15 at 01:27
Unicode code points U+0000 to U+007F are identical to the ISO 8859-x code points 0x00 to 0x7F, and are encoded as single bytes in the range 0x00..0x7F in UTF-8, so there is no need to be concerned about characters in the 'plain ASCII' range. The short answer to whether your query will work is "I don't know". The best I can say is "maybe" and "try it". Put some accented characters into a table in 8859-1, then try the query. If it works, the answer's yes. If not, no. – Jonathan Leffler Jan 24 '15 at 01:31
Thanks! Query pulled from here btw--seems to work in basic testing: http://stackoverflow.com/questions/9304485/how-to-detect-utf-8-characters-in-a-latin1-encoded-column-mysql?rq=1 – Jeff Widman Jan 24 '15 at 01:34
Looks like that might be an appropriate dup for this question; it seems to be tackling the same issue (migrating from Latin 1 to UTF-8 in MySQL). – Jonathan Leffler Jan 24 '15 at 01:36

Are there hidden encoding errors that I need to fix in Latin 1 --> UTF-8?

1 Answers1