0

I could not find much information about the compatibility and performance consequences of selecting and comparing values from database columns with different collations.

As I am designing a new database, I am thinking about whether it would be a good idea to make some columns (or whole tables) in the database use a different collation to save space and improve performance. For example, using the utf8mb3/utf8 (or even latin1 for ASCII/latin-only values) charset in tables where there is no need for storing full Unicode, like no emojis and so on. While using utf8mb4 in the rest of the cases. I am mostly concerned with the compatibility and performance in the case of comparisons between different charsets and collations.

So how do e.g. utf8 and utf8mb4 work when values are compared (e.g. in a WHERE clause)? If I understand correctly, this should lead to an error.

Also, would there be any considerable performance issues when comparing values from columns with collations utf8mb4_unicode_ci and utf8mb4_general_ci or utf8mb4_unicode_520_ci? What about just selecting from different tables (e.g. doing a JOIN on two tables with different collations in each of their columns but having the exact same collation for the common column they are joined on)?

I ask these questions only in the context of MySQL and/or MariaDB. Any information/resources about those are very welcome.

EDIT 1: If I understood correctly from these question and answers, comparing utf8mb4_unicode_ci and utf8mb4_general_ci wouldn't work by default without explicitly specifying a collation to be used?

EDIT 2: I tested with MariaDB comparing two columns of different collations and got the exact same error: #1267 - Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8mb4_general_ci,IMPLICIT) for operation '='. However, comparing utf8mb4_general_ci and utf8mb3_unicode_ci works there, but why?

EDIT 3: I think the answer to the question of EDIT 2 is coercibility. In case someone else has a similar question or issue, I think this and this answer are very informative.

EDIT 4: Are there any cases where implicit conversion or explicit (via casting) would not cause indexes to be ignored?

kataba
  • 93
  • 6
  • Please don't insert "EDIT"s/"UPDATE"s, just make your post the best presentation as of edit time. Please ask 1 specific researched non-duplicate question. [ask] [Help] – philipxy Aug 06 '22 at 03:17

1 Answers1

0
  • Comparing a column to a literal (quoted string)

This leads to the confusing list of cases that you have listed. Some will "coerce" and use an index; some will do a scan instead of using an index; some with spit at you. (I can't elaborate on which case is which.)

  • Comparing columns used for a JOIN

These must be the same collation, else INDEXes won't be used. I suspect there are some edge cases where coercion "should" work, but there could be edge cases that make it unwise to do so.

I'm thinking of a simple JOIN between two tables. The Optimizer would like to pick which table to start with. That would force the coercion to happen in one way versus the other. The effect might be inconsistent.

utf8mb4_general_ci versus utf8mb4_unicode_520_ci -- The former treats each character alone; the latter treats an accented letter as equal to the letter plus the accent (two "characters"). (There are a lot of other differences.)

And then comes the issue of "which of the two collations should rule?". Should the Turkish collation be used in preference to the Vietnamese?

You may know that the text has no Emoji, but MySQL doesn't. What should it do if the utf8mb4 side of the JOIN had an emoji? Equal and not-equal might work, but what about greater-than or ORDER BY?

Rick James
  • 135,179
  • 13
  • 127
  • 222