I could not find much information about the compatibility and performance consequences of selecting and comparing values from database columns with different collations.
As I am designing a new database, I am thinking about whether it would be a good idea to make some columns (or whole tables) in the database use a different collation to save space and improve performance. For example, using the utf8mb3
/utf8
(or even latin1
for ASCII/latin-only values) charset in tables where there is no need for storing full Unicode, like no emojis and so on. While using utf8mb4
in the rest of the cases. I am mostly concerned with the compatibility and performance in the case of comparisons between different charsets and collations.
So how do e.g. utf8
and utf8mb4
work when values are compared (e.g. in a WHERE
clause)? If I understand correctly, this should lead to an error.
Also, would there be any considerable performance issues when comparing values from columns with collations utf8mb4_unicode_ci
and utf8mb4_general_ci
or utf8mb4_unicode_520_ci
? What about just selecting from different tables (e.g. doing a JOIN
on two tables with different collations in each of their columns but having the exact same collation for the common column they are joined on)?
I ask these questions only in the context of MySQL and/or MariaDB. Any information/resources about those are very welcome.
EDIT 1: If I understood correctly from these question and answers, comparing utf8mb4_unicode_ci
and utf8mb4_general_ci
wouldn't work by default without explicitly specifying a collation to be used?
EDIT 2: I tested with MariaDB comparing two columns of different collations and got the exact same error: #1267 - Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8mb4_general_ci,IMPLICIT) for operation '='
.
However, comparing utf8mb4_general_ci
and utf8mb3_unicode_ci
works there, but why?
EDIT 3: I think the answer to the question of EDIT 2 is coercibility. In case someone else has a similar question or issue, I think this and this answer are very informative.
EDIT 4: Are there any cases where implicit conversion or explicit (via casting) would not cause indexes to be ignored?