utf8_bin vs. utf_unicode_ci

Question

My table Website

Website_Name//column name
Google
Facebook
Twitter
Orkut
Frype
Skype
Yahoo
Wikipedia

I i use utf8_bin collation then my query to search wikipedia in Website is

Select Website_Name from Website where lower(Website_Name)='wikipedia'

And if i use utf8_unicode_ci then my select query to search wikipedia in Website is

Select Website_Name from Website where Website_Name='wikipedia'

Now I want to know which collation is best depending upon the following queries

score 79 · Accepted Answer · edited Feb 02 '15 at 23:45

79

It depends on what you need.

The utf8_bin collation compares strings based purely on their Unicode code point values. If all of the code points have the same values, then the strings are equal. However, this falls apart when you have strings with different composition for combining marks (composed vs. decomposed) or characters that are canonically equivalent but don't have the same code point value. In some cases, using utf8_bin will result in strings not matching when you expect them to. Theoretically, utf8_bin is the fastest because no Unicode normalization is applied to the strings, but it may not be what you want.

utf8_general_ci applies Unicode normalization using language-specific rules and compares strings case-insensitively. utf8_general_cs does the same, but compares strings case-sensitively.

edited Feb 02 '15 at 23:45

mikeytown2

1,744
24
37

answered Jun 07 '12 at 10:20

Delan Azabani

79,602
28
170
210

so what sould i use .be specific – Jun 07 '12 at 10:21
1

As I said, you should make this decision based on what you need. From what I can see about what you're trying to do, I'd go with `utf8_general_ci`, myself. – Delan Azabani Jun 07 '12 at 10:26
1

Is there any disadvantage in using lower() with utf8_bin – Jun 07 '12 at 10:28
2

While it is the same thing in English, using `lower()` is not always the same thing as comparing case-insensitively in some languages, and you may run into problems by using `lower()` to compare case-insensitively. I'd stick with using a case-insensitive collation. – Delan Azabani Jun 07 '12 at 10:31

score 15 · Answer 2 · answered Jun 07 '12 at 10:40

Personally I would go with utf8_unicode_ci, if you expect that lettercase is generally not important for the results you want to find.

Collations aren't only used at runtime, but also when MySQL builds indexes. So if any of these columns appear in an index, finding data according to the comparison rules of that collation will be pretty much as fast as it ever gets.

In those cases where you do not want case insensitive matching, then do not apply upper or lower. Instead, apply the BINARY keyword in front of the utf8 column to force a literal code-point comparison rather than one according to the collation.

mysql> create table utf8 (name varchar(24) charset utf8 collate utf8_general_ci, primary key (name));
Query OK, 0 rows affected (0.14 sec)

mysql> insert into utf8 values ('Roland');
Query OK, 1 row affected (0.00 sec)

mysql> insert into utf8 values ('roland');
ERROR 1062 (23000): Duplicate entry 'roland' for key 'PRIMARY'
mysql> select * from utf8 where name = 'roland';
+--------+
| name   |
+--------+
| Roland |
+--------+
1 row in set (0.00 sec)

mysql> select * from utf8 where binary name = 'roland';
Empty set (0.01 sec)

This should be much faster than using lower or upper, since in those cases, MySQL first needs to make a copy of the column value and modify its lettercase, and then apply the comparison. With BINARY in place it will simply use the index first to find matches, and then do a code-point by code-point comparison untill it finds the values are not equal, which will generally be faster.

Just a heads up from my experience; using `WHERE BINARY` or `COLLATE utf8_bin` have a negative performance impact on queries that use the PRIMARY KEY when the row is `utf8_general_ci`. Tested on MySQL 5.6.22 & 5.6.10. Issue didn't show up until the database was under decent load. — mikeytown2, Feb 02 '15 at 20:59

score 9 · Answer 3 · answered Feb 18 '16 at 13:24

9

I was using 'utf8_unicode_ci' which is default by doctrine , i had to change it to :

 * @ORM\Table(name = "Table", options={"collate"="utf8_bin"})

Since some of my composite primary keys consisted of text fields. Sadly 'utf8_unicode_ci' resolved "poistný" and "poistny" as same primary key value and ended with crash at doctrine inserting flush. I could not simply change collation of one part of composite primary key, had to drop table and recreate. Hope it saves time to someone else..

answered Feb 18 '16 at 13:24

Jiro Matchonson

875
1
17
24

1

This article [utf8_unicode_ci vs utf8_general_ci](https://forums.mysql.com/read.php?103,187048,188748) explained the differences between `utf8_unicode_ci` and `utf8_general_ci` – boris1993 Dec 12 '19 at 05:57

utf8_bin vs. utf_unicode_ci

3 Answers3

Linked