17

Please help me understand how multibyte characters like emoji's are handled in MySQL utf8mb4 fields.

See below for a simple test SQL to illustrate the challenges.

/* Clear Previous Test */
DROP TABLE IF EXISTS `emoji_test`;
DROP TABLE IF EXISTS `emoji_test_with_unique_key`;

/* Build Schema */
CREATE TABLE `emoji_test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `string` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
  `status` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `emoji_test_with_unique_key` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `string` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
  `status` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_string_status` (`string`,`status`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

/* INSERT data */
# Expected Result is successful insert for each of these.
# However some fail. See comments.
INSERT INTO emoji_test (`string`, `status`) VALUES ('', 1);                   # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('', 1);                   # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('', 1);                 # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('', 1);                 # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('', 1);   # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('', 1);   # FAIL: Duplicate entry '?-1' for key 'idx_string_status'
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('', 1); # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('', 1); # FAIL: Duplicate entry '??-1' for key 'idx_string_status'

/* Test data */

    /* Simple Table */
SELECT * FROM emoji_test WHERE `string` IN ('','','',''); # SUCCESS (all 4 are found)
SELECT * FROM emoji_test WHERE `string` IN ('');                     # FAIL: Returns both  and 
SELECT * FROM emoji_test WHERE `string` IN ('');                     # FAIL: Returns both  and 
SELECT * FROM emoji_test;                                              # SUCCESS (all 4 are found)

    /* Table with Unique Key */
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('','','',''); # FAIL: Only 2 are found (due to insert errors above)
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('');                     # SUCCESS
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('');                     # FAIL:  found instead of 
SELECT * FROM emoji_test_with_unique_key;                                              # FAIL: Only 2 records found ( and )

I'm interested in learning what causes the FAILs above and how I can get around this.

Specifically:

  1. Why do selects for one multibyte character return results for any multibyte character?
  2. How can I configure an index to handle multibyte characters instead of ??
  3. Can you recommend changes to the second CREATE TABLE (the one with a unique key) above in such a way that makes all the test queries return successfully?
Ryan
  • 14,682
  • 32
  • 106
  • 179
  • 7
    As any Mexican can tell you, (['TACO' (U+1F32E)](http://www.fileformat.info/info/unicode/char/1f32e/index.htm)) and (['HOT PEPPER' (U+1F336)](http://www.fileformat.info/info/unicode/char/1f336/index.htm)) are clearly related but different things. This must be the most wonderfully composed question in years. – Álvaro González Dec 14 '16 at 16:59
  • Related: http://stackoverflow.com/questions/38116984/finding-values-case-insensitively-with-emojis : *Solution is to use MySQL 5.6+ and to use utf8mb4_unicode_520_ci collation which doesn't treat all 4 bytes characters as equal* - A pretty good reason to avoid emojis as passwords :) – Álvaro González Dec 14 '16 at 17:28
  • 1
    @ÁlvaroGonzález Well if this is a problem for passwords, then there is a bigger bigger problem with the given setup, because passwords should be stored with a oneway hash. And for hashing, it _shouldn't_ be a problem. But I also wouldn't suggest to use them for passwords. – t.niese Dec 14 '16 at 17:37

2 Answers2

27

You use utf8mb4_unicode_ci for your columns, so the check is case insensitive. If you use utf8mb4_bin instead, then the emoji and are correctly identified as different letters.

With WEIGHT_STRING you can get the values that are use for sorting and comparison for the input string.

If you write:

SELECT
  WEIGHT_STRING ('' COLLATE 'utf8mb4_unicode_ci'),
  WEIGHT_STRING ('' COLLATE 'utf8mb4_unicode_ci')

Then you can see that both are 0xfffd. In Unicode Character Sets they say:

For supplementary characters in general collations, the weight is the weight for 0xfffd REPLACEMENT CHARACTER.

If you write:

SELECT 
  WEIGHT_STRING('' COLLATE 'utf8mb4_bin'),
  WEIGHT_STRING('' COLLATE 'utf8mb4_bin')

You will get their unicode values 0x01f32e and 0x01f336 instead.

For other letters like Ä, Á and A that are equal if you use utf8mb4_unicode_ci, the difference can be seen in:

SELECT
  WEIGHT_STRING ('Ä' COLLATE 'utf8mb4_unicode_ci'),
  WEIGHT_STRING ('A' COLLATE 'utf8mb4_unicode_ci')

Those map to to the weight 0x0E33

Ä: 00C4  ; [.0E33.0020.0008.0041][.0000.0047.0002.0308] # LATIN CAPITAL LETTER A WITH DIAERESIS; QQCM
A: 0041  ; [.0E33.0020.0008.0041] # LATIN CAPITAL LETTER A

According to : Difference between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci collations in MariaDB/MySQL? the weights used for utf8mb4_unicode_ci are based on UCA 4.0.0 because the emoji do not appear in there, the mapped weight is 0xfffd

If you need case insensitive compares and sorts for regular letters along with emoji then this problem is solved using utf8mb4_unicode_520_ci:

SELECT
  WEIGHT_STRING('' COLLATE 'utf8mb4_unicode_520_ci'),
  WEIGHT_STRING('' COLLATE 'utf8mb4_unicode_520_ci')

there will also get different weights for those emoji 0xfbc3f32e and 0xfbc3f336.

Community
  • 1
  • 1
t.niese
  • 39,256
  • 9
  • 74
  • 101
  • This is incredible. Switching encoding to `utf8mb4_bin` in the `CREATE TABLE`s above made the rest of the test queries work exactly as expected. Thanks so much. Any further insight into this would be appreciated. – Ryan Dec 14 '16 at 17:05
  • 1
    No wonder binary collation fixes the issue (that's what it's meant for) but I can't understand why would two entirely different emojis be considered as case variations of the same character. I doubt it's intentional. – Álvaro González Dec 14 '16 at 17:09
  • @ÁlvaroGonzález a similar reason why `Ä`, `Á` and `A` are the same, even if they might have different pronunciation and meaning. My first though was, that they are treated as equal, because they are all in the category food, but its more likely that the `ci` just checks if they are emoji. – t.niese Dec 14 '16 at 17:12
  • So... Collation database doesn't have information about them so they get assigned a generic common weight thus become "equal"? – Álvaro González Dec 14 '16 at 17:35
  • 3
    8.0 will usher in `utf8mb4_0900_ai_ci`, based on UCA 9.0.0. – Rick James Dec 15 '16 at 06:06
2

Don't need to go to weights. Do something like this to see whether two characters (or strings) are equal.

mysql> SELECT '' = '' COLLATE utf8mb4_unicode_ci;
+--------------------------------------+
| '?' = '?' COLLATE utf8mb4_unicode_ci |
+--------------------------------------+
|                                    1 |  1 = true, hence equal
+--------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT '' = '' COLLATE utf8mb4_unicode_520_ci;
+------------------------------------------+
| '?' = '?' COLLATE utf8mb4_unicode_520_ci |
+------------------------------------------+
|                                        0 |  unequal
+------------------------------------------+
1 row in set (0.00 sec)
Rick James
  • 135,179
  • 13
  • 127
  • 222