I've been converting some large tables from latin1 to utf8 and found the same problem as this user. But the table I was converting from has the collation latin1_general_ci (or latin_swedish_ci). So why does MySQL have different interpretations of "case-insensitive" in different character sets? Because latin1 does not check that o=ö or o=oe a unique latin1 index can build up thousands of clashes.
2 Answers
The are two reasons:
Case is a locale stuff. Different locales could give different character as lower case (or upper case). IIRC Turkish I
should have ı
(U+0131 LATIN SMALL LETTER DOTLESS I) as lower case. See e.g. Unicode Casemap FAQ. So the _swedish_
is relevant.
Additionally, generic unicode algorithm is complex, and map from Unicode strings to Unicode strings. Using that on other charsets could cause problems (implementation should check and handle differently the cases where transformed case is outside original charset). Additionally, Unicode is "modern", so MySQL users do[did] no want that MySQL changes string equality from one version to the next one (e.g. pre-Unicode to Unicode-as-first-class-charset [which BTW it is not yet so]).

- 8,519
- 2
- 24
- 32
-
I agree there's a locale issue and I was surprised that using e.g. utf8_german2.ci didn't help (though the table in question strays outside this). But it seems the second part of your question says that latin1 dates from before anyone thought about this and they didn't want to change it because it would maybe break lots of existing databases. Is there any better reason than that? – watergeus Apr 04 '18 at 14:25
-
I must test exactly what characters are interpreted differently. Unicode tried to have a more sensible algorithm, which can be used generically on all languages. Latin1 is much more oriented on some west European rules of collation. – Giacomo Catenazzi Apr 04 '18 at 14:34
-
I haven't done an exhaustive test but in general in latin1 'a'!='á', 'e'!='è', 'i'!='î', 'o'!='ö' etc. whereas in utf-8 those are all equal. – watergeus Apr 04 '18 at 14:43
-
And I was wrong, From mysql: "The xxx_general_mysql500_ci collations preserve the pre-5.1.24 ordering of the original xxx_general_ci collations and permit upgrades for tables created before MySQL 5.1.24", So stability was not a problem. – Giacomo Catenazzi Apr 04 '18 at 14:55
-
And I was also wrong that `uf8_generic_ci` uses unicode algorithm.(`uf8_unicode_ci` does it). So it is pure mysql implementation details. See https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html for some description, but nothing about reason of latin1_generic (https://dev.mysql.com/doc/refman/5.7/en/charset-we-sets.html). – Giacomo Catenazzi Apr 04 '18 at 15:04
-
Thanks Giacomo for following this up. But I can't find any xxx_general_mysql500_ci collations in my information_schema>>collations table. It looks like I'll have to convert the table from scratch folding the overlaps somehow. – watergeus Apr 04 '18 at 15:52
The collations with a language name or country code are tailored to that language. Swedish, for example, sorts Å
(A-ring) after Z
("On beyond Zebra"?) Most other languages treat sort it identical to A
.
Notice that there are several different latin1 collations, and lots of utf8 collations.
I captured the history of utf8_general_mysql500_ci
and the issues with ß
here .
MySQL's ...general...
collations look at a single byte at a time, thereby treating 'oe' or 'ss' or 'll' always as 2 letters. 'General' is faster, but rarely useful.
...bin
just checks bytes. No case folding; no accent stripping.
MySQL ties together case folding and accent stripping in nearly all collations (...ci
). There are only a few ...cs
('case sensitive').
To see what is equal or not in various utf8 collations: http://mysql.rjweb.org/utf8_collations.html
For utf8mb4 (MySQL 8.0): http://mysql.rjweb.org/utf8mb4_collations.html

- 135,179
- 13
- 127
- 222