Personally I would go with utf8_unicode_ci
, if you expect that lettercase is generally not important for the results you want to find.
Collations aren't only used at runtime, but also when MySQL builds indexes. So if any of these columns appear in an index, finding data according to the comparison rules of that collation will be pretty much as fast as it ever gets.
In those cases where you do not want case insensitive matching, then do not apply upper or lower. Instead, apply the BINARY
keyword in front of the utf8 column to force a literal code-point comparison rather than one according to the collation.
mysql> create table utf8 (name varchar(24) charset utf8 collate utf8_general_ci, primary key (name));
Query OK, 0 rows affected (0.14 sec)
mysql> insert into utf8 values ('Roland');
Query OK, 1 row affected (0.00 sec)
mysql> insert into utf8 values ('roland');
ERROR 1062 (23000): Duplicate entry 'roland' for key 'PRIMARY'
mysql> select * from utf8 where name = 'roland';
+--------+
| name |
+--------+
| Roland |
+--------+
1 row in set (0.00 sec)
mysql> select * from utf8 where binary name = 'roland';
Empty set (0.01 sec)
This should be much faster than using lower or upper, since in those cases, MySQL first needs to make a copy of the column value and modify its lettercase, and then apply the comparison. With BINARY in place it will simply use the index first to find matches, and then do a code-point by code-point comparison untill it finds the values are not equal, which will generally be faster.