Emojis as question marks in MariaDB primary key

Question

With the following table, using mariadb-server-10.1 10.1.32+maria-1~trusty:

CREATE TABLE `tags` (
  `tag_name` varchar(150) COLLATE utf8mb4_unicode_ci NOT NULL,
  `thing_id` int(11) NOT NULL,
  PRIMARY KEY (`thing_id`,`tag_name`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Manifests as (in SQLAlchemy app):

IntegrityError: (IntegrityError) 
(1062, "Duplicate entry '1532-?' for key 'PRIMARY'")

at the second emoji insertion attempt.

The first one seems to be ok in the db (believe me, it shows right in my console and my browser as "alien" emoji):

> select tag_name, HEX(tag_name) from tags;
+----------+----------------+
| tag_name | HEX(tag_name)  |
+----------+----------------+
| GOODGUY  | 474F4F44475559 |
|         | F09F91BD       |
+----------+----------------+
2 rows in set (0.00 sec)

I am aware of Emoji's in mysql turns to question marks , but my.cnf has:

default-character-set = utf8mb4
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
default-character-set = utf8mb4

and utf8mb4 is being used in the client (charset added to connection string as ?charset=utf8mb4). And I believe primary key is checked on the server side.

Is there anything else I am missing (can check) or is it some MariaDB bug or some additional configuration required?

not sure whether relevant or not, but the same problem when inserting via mysql shell. Also tried this to see what is going on (found this in mysql utf8mb4_unicode_ci cause unique key collision ):

> SELECT ''='' COLLATE utf8mb4_unicode_ci;
+------------------------------------+
| '?'='?' COLLATE utf8mb4_unicode_ci |
+------------------------------------+
|                                  1 |
+------------------------------------+

but not sure whether this relevant.

I do not understand: database sometimes shows it alright (via both clients - shell and SQLAlchemy app), but then fails to show in the header? From the evidence I got, I do not understand where that bad conversion happens. The data seems to be ok in the database (see hex above), but two emojis equivalent for the primary key?

One more to contrast:

> SELECT ''='' COLLATE utf8mb4_bin;
+-----------------------------+
| '?'='?' COLLATE utf8mb4_bin |
+-----------------------------+
|                           0 |
+-----------------------------+

This kind of points finger primary key does not use binary? All emojis converted to something else before being used in the index? Quite weird.

See question marks in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored — Rick James, Jul 20 '18 at 14:30
Possible duplicate of [Trouble with UTF-8 characters; what I see is not what I stored](https://stackoverflow.com/questions/38363566/trouble-with-utf-8-characters-what-i-see-is-not-what-i-stored) — Rick James, Jul 20 '18 at 14:30
@RickJames please, do not jump to conclusions so quickly. The question you refer to does not solve my problem. I am already using techniques described there. — Roman Susi, Jul 20 '18 at 15:40
Could it be that #1532 is already tagged with that Emoji? And you are `INSERTing` the same tag again? And the 'question mark' is actually coming from SQLAlchemy, not MySQL? If so, consider `INSERT IGNORE ...`. — Rick James, Jul 20 '18 at 18:37
Its the same via mysql shell. I found another collation-related answer, but it's not definite what to do. Will update answer in a minute — Roman Susi, Jul 20 '18 at 19:21

score 2 · Accepted Answer · answered Jul 20 '18 at 19:39

2

mysql> SELECT ''='' COLLATE utf8mb4_unicode_520_ci;
+----------------------------------------+
| '?'='?' COLLATE utf8mb4_unicode_520_ci |
+----------------------------------------+
|                                      0 |
+----------------------------------------+

Two things to note:

The 520 collation treats them as different.
The question marks are a display problem.

To further isolate that...

mysql> SELECT HEX('');
+----------+
| HEX('?') |
+----------+
| F09F9886 |
+----------+

My point is that the question mark may be a display problem, and (as you point out with the hex) it is not an encoding problem:

(1062, "Duplicate entry '1532-?' for key 'PRIMARY'")

And, if you want to tag 1532 with both Emoji, that column needs to be utf8mb4_unicode_520_ci (or utf8mb4_bin).

answered Jul 20 '18 at 19:39

Rick James

135,179
13
127
222

Yes, I have the same results as above. But what is that `utf8mb4_unicode_520_ci` ? Why should I choose it over my current choice? I see a lot of posts how to get rid of that "520". – Roman Susi Jul 20 '18 at 19:47
Ok. found description https://stackoverflow.com/questions/18904883/what-is-the-difference-between-utf8-unicode-ci-and-utf8-unicode-520-ci – Roman Susi Jul 20 '18 at 19:49
@RomanSusi - The 520 treats Emoji as distinct. That link is talking about Turkish. 520 does not deal as well with Language-specific requirements. – Rick James Jul 20 '18 at 19:49
https://stackoverflow.com/questions/42385099/1273-unknown-collation-utf8mb4-unicode-520-ci , plus more by googling - I have not read any further. Of course, there are similarly misguided advices to replace utf8mb4 with utf8 – Roman Susi Jul 20 '18 at 19:51
@RomanSusi - And the second link dealt with someone downgrading from 5.6 (which has 520) to 5.5. Again, a valid reason for avoiding 520, but not a reason that applies for you. – Rick James Jul 20 '18 at 19:57
Maybe I was not paying that much attention to versions there, but before switching to another collation at least I need to know what else will happen. The system uses more collations / character than emoji and ASCII. However, I must accept your answer. Thanks! – Roman Susi Jul 20 '18 at 20:02
It may be that 520 did not exist in your version 10.1.10? Or is that 10.1.32? – Rick James Jul 20 '18 at 20:06
It's 10.1.32 - it is there. – Roman Susi Jul 20 '18 at 20:29

Emojis as question marks in MariaDB primary key

1 Answers1