mysql utf8mb4_unicode_ci cause unique key collision

Question

I have a table like this

CREATE TABLE `mb1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(30) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `u_name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

And I insert tow rows

insert into mb1(name) values('K'),('Ｋ');

Note, the second K is unicode character

+------+-----------+
| name | hex(name) |
+------+-----------+
| K    | 4B        |
| Ｋ   | EFBCAB    |
+------+-----------+

Why do they cause unique key collision? Aren't they different character in utf8mb4?

After remove COLLATE utf8mb4_unicode_ci, the problem disappeared.

If you want case folding, but accent sensitivity, please file a request at http://bugs.mysql.com . — Rick James, Mar 14 '17 at 22:33

score 4 · Accepted Answer · answered Aug 28 '15 at 04:03

Why do they cause unique key collision? Aren't they different character in utf8mb4?

You are missing the point about CHARACTER SET and COLLATION. A CHARACTER SET is a collection of different characters. A COLLATION says whether to treat the characters as equal -- think A and a -- different characters, but treated for ORDER BY and WHERE =, etc as being the same.

mysql> SELECT 'K'='Ｋ' COLLATE utf8_unicode_ci;
+-----------------------------------+
| 'K'='Ｋ' COLLATE utf8_unicode_ci  |
+-----------------------------------+
|                                 1 |
+-----------------------------------+

So in utf8_unicode_ci (or utf8mb4_unicode_ci), those two characters are considered to be "equal".

"Equal" is the test for UNIQUE keys.

Set the COLLATION for the column to whatever makes sense for you.

utf8mb4_unicode_ci for good 'real life' comparisons, apparently including this one. K=k=Ķ=ķ
utf8mb4_unicode_ci for more simple-minded comparisons. In particular no 2-character combinations match 1-character encodings. Case folding and accent stripping does occur. K=k=Ķ=ķ
utf8mb4_bin blindly checks the bits. No case folding, etc. K k Ķ ķ are all unequal.

utf8mb4_latvian_ci is a little different: K=k but not equal to Ķ=ķ . There are other specialized collations for other languages (mostly Western European).

Your Ｋ is called "FULLWIDTH LATIN CAPITAL LETTER K", so it is quite reasonable that it compare equal to Latin K.

mysql utf8mb4_unicode_ci cause unique key collision

1 Answers1

Linked