3

I'm trying to convert a database to use utf8mb4 instead of utf8. Everything is going fine except one table:

CREATE TABLE `search_terms` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `search_term` varchar(128) NOT NULL,
  `time_added` timestamp NULL DEFAULT NULL,
  `count` int(10) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  UNIQUE KEY `search_term` (`search_term`),
  KEY `search_term_count` (`count`)
) ENGINE=InnoDB AUTO_INCREMENT=198981 DEFAULT CHARSET=utf8;

Basically all it does is save an entry every time somebody searches something in a form so we can track the number of searches, very simple.

There's a unique index on search_term because we want to only have one row per search term and instead increment the count value.

However when converting to utf8mb4 I am getting duplicate entry errors. Here is the command I am running:

ALTER TABLE `search_terms` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Looking in the database I can see various examples like this:

fm2012

fm2012

fm2012

In it's current utf8 character set, these are all being treated as unique and exist within the database without ever having an issue with the unique index on search_term.

But when converting to utf8mb4 they are now being considered equal and throwing an error due to that index.

I can figure out how to merge these together easily enough, but i'm concerned this may be a symptom of a greater underlying problem. I'm not really sure how this has happened or what the consequences may be, so my questions are a bit vague:

  1. Why is utf8mb4 treating these differently to utf8?
  2. What are the possible consequences?
  3. Is there someway I can do a conversion so things like "fm2012" never appear in my database and I only have "fm2012" (I am also using Laravel 5.1)
Community
  • 1
  • 1
John Mellor
  • 2,351
  • 8
  • 45
  • 79
  • That looks more like UTF-32 and UTF-16 to be honest. It's odd the conversion does that. Does this happen on trivial test tables? – tadman Mar 21 '17 at 04:21
  • Could you please paste your error? – Nishant Nair Mar 21 '17 at 04:46
  • What's the collation before the conversion and are these really spaces in your example string "fm2012"? Can you get us the hex dump? – Beat Mar 21 '17 at 05:16
  • @NishantNair it's a simple "Duplicate entry 'fm2012' for key 'seach_term'" – John Mellor Mar 21 '17 at 05:31
  • @Beat it's "utf8_general_ci" and as far I can tell it's not spaces. It appears exactly as it appears in this SO question. I'm not sure what you mean by the hex dump – John Mellor Mar 21 '17 at 05:31

2 Answers2

3

Your problem is the change of collation: you're using general_ci and you're converting to unicode_ci: general_ci is quite a simple collation that doesn't know much about unicode, but unicode_ci does.

The first "f" in your example string is a "Fullwidth Latin Small Letter F" (U+FF46) which is considered equal to "Latin Small Letter F" (U+0066) by unicode_ci but not by general_ci.

Normally it's recommended to use unicode_ci exactly because of its unicode-awareness but you could convert to utf8mb4_general_ci to prevent this problem.

To prevent this problem in the future, you should normalize your input before saving it in the DB. Normally you'd use NFC, but your case seems to call for NFKC. This should bring all "equivalent" strings to the same form.

Beat
  • 1,337
  • 15
  • 30
  • Ah, thanks alot that makes sense. Are there any other differences in using utf8mb4_general_ci instead of utf8mb4_unicode_ci? It seems like a good opportunity to make the switch but I guess it could also cause more problems. – John Mellor Mar 21 '17 at 05:55
  • There are many differences, see http://stackoverflow.com/a/2344130/853468 – Beat Mar 21 '17 at 05:58
1

Despite what was said previously it is not about general_ci being more simplistic than unicode_ci. Yes, it can be true, but the issue is that you need to keep it matching to the sub-type you have.

For example, my database is utf8_bin. I cannot convert to utf8mb4_unicode_ci nor to utf8mb4_general_ci. These commands will throw an error of a duplicate key being found. However the correct collation utf8mb4_bin completes without issues.

TS'
  • 113
  • 1
  • 13