which mysql utf8 collation to use for data coming from latin1

Question

I have a lot of data in mysql in latin1 format, and I'm trying to convert everything to utf8, but so far I didn't find the right collation to convert it correctly. Some of the data are international names, with a lot of accent, and characters from various language and a primary key on these fields. I have a simple test case that goes like this:

CREATE TABLE utf8_test ( value varchar(30), PRIMARY KEY(value) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_swedish_ci;
INSERT INTO utf8_test VALUES ('e');
INSERT INTO utf8_test VALUES ('é');
INSERT INTO utf8_test VALUES ('è');
INSERT INTO utf8_test VALUES ('ü');
INSERT INTO utf8_test VALUES ('u');
INSERT INTO utf8_test VALUES ('y');
INSERT INTO utf8_test VALUES ('ÿ');

It's a simple test to demonstrate the problem, but real data is not limited to this.

So far, only utf8_bin can accept everything without duplicate error, but I can't use it, because it's case-sensitive. Am I missing something ?

note: some of the tables have more than a dozen millions rows, so performance is a factor.

`utf8_bin` *is* case sensitive - do you need case sensitivity? That is, do you need the collation to distinguish between 'A' and 'a', for example? — Mike, Jun 24 '11 at 14:20
my bad, I meant case-sensitive, corrected. (I wand A=a when I do a search) — anto1ne, Jun 24 '11 at 14:52

score 1 · Answer 1 · edited May 23 '17 at 11:48

The correct collation depends on more than the source character set (Latin-1). It depends on the language and locale that the data came from. If you're just worried about making sure that no characters are folded one into another, utf8_bin is probably the way to go.

You may experience unexpected sorting, though, where the characters are sorted according to the UTF-8 byte values, rather than according to any particular language or regional preferences.

In order to get a case-insensitive collation, you may want to try utf8_general_ci or utf8_unicode_ci, as described in UTF-8: General? Bin? Unicode?

Edit Further research suggests that there's currently no collation that you can use. See http://bugs.mysql.com/bug.php?id=19567.

How's this for an option? Assign utf8_bin to your column(s) but apply a different collation (one that will fold case and accents) on your queries? You won't lose any data, and your queries will be case-insensitive. If having accent-insensitive queries don't hurt you, it could be a compromise...

I tried both, and they consider e=é and y=ÿ for example, causing duplicate keys. — anto1ne, Jun 24 '11 at 14:00
Oh. That's disappointing. Regrettably, I'm not a MySQL user myself, so I don't have anything else to suggest. — Blair Conrad, Jun 24 '11 at 20:37

score 1 · Answer 2 · answered Jun 24 '11 at 14:48

As far as I know, MySQL does not yet have case sensitive utf8 collations, so utf8_bin is about your only choice. utf8_bin is case sensitive, and treats all the diacritical marks as being separate. As Blair Conrad pointed out, it can make a bit of a mess of sorting. You may be able to solve that by using collate in your SQL statements, although if you have a lot of SQL statements, this approach could be problematic.

CREATE TABLE `utf8_test` (
  `value` varchar(30) COLLATE utf8_bin NOT NULL DEFAULT '',
  PRIMARY KEY (`value`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

INSERT INTO utf8_test VALUES ('e');
INSERT INTO utf8_test VALUES ('é');
INSERT INTO utf8_test VALUES ('è');
INSERT INTO utf8_test VALUES ('ü');
INSERT INTO utf8_test VALUES ('u');
INSERT INTO utf8_test VALUES ('y');
INSERT INTO utf8_test VALUES ('ÿ');

INSERT INTO utf8_test VALUES ('E');
INSERT INTO utf8_test VALUES ('É');
INSERT INTO utf8_test VALUES ('È');
INSERT INTO utf8_test VALUES ('Ü');
INSERT INTO utf8_test VALUES ('U');
INSERT INTO utf8_test VALUES ('Y');
INSERT INTO utf8_test VALUES ('Ÿ');

SELECT value FROM utf8_test WHERE value = 'E';
+-------+
| value |
+-------+
| E     |
+-------+

SELECT value FROM utf8_test WHERE value COLLATE utf8_unicode_ci = 'E';

+-------+
| value |
+-------+
| E     |
| e     |
| È     |
| É     |
| è     |
| é     |
+-------+

SELECT value FROM utf8_test ORDER BY value;

+-------+
| value |
+-------+
| E     |
| U     |
| Y     |
| e     |
| u     |
| y     |
| È     |
| É     |
| Ü     |
| è     |
| é     |
| ü     |
| ÿ     |
| Ÿ     |
+-------+

SELECT value FROM utf8_test ORDER BY value COLLATE utf8_unicode_ci;

+-------+
| value |
+-------+
| E     |
| é     |
| è     |
| É     |
| È     |
| e     |
| u     |
| Ü     |
| U     |
| ü     |
| y     |
| Y     |
| ÿ     |
| Ÿ     |
+-------+

which mysql utf8 collation to use for data coming from latin1

2 Answers2