1

My MySQL server doesn't recognize the difference between characters 'æ' and 'ae' while storing data to database and that creates some problems for me. My goal is to find a charset which recognizes the difference between those characters, and I found it (utfmb3), but it is going to be deprecated, and the new alternative (utfmb4) doesn't recognize those characters as different.

What I've tried:

set names 'utf8mb3';
select 'æ' = 'ae';

This select returns 0 (false), which means this charset sees these as different characters, and that's just what I need, but MySQL gives me a warning: 'utf8mb3' is deprecated and will be removed in a future release. Please use utf8mb4 instead

But when I do

set names 'utf8mb4';
select 'æ' = 'ae';

This select returns 1, which means utf8mb4 sees these as the same characters, which is not good..

So, my dilema is, what charset to use? If I use utfmb3, it will be deprecated soon, that's no good. If I use utfmb4, that won't work correctly.

O. Jones
  • 103,626
  • 17
  • 118
  • 172

2 Answers2

2

= and LIKE comparisons in WHERE clauses apply a collation (not just a character set) to determine this kind of equality. This statement returns zero for the first two collations and one for the second two.

SELECT 'æ' = 'ae' COLLATE utf8mb4_unicode_ci,       -- 0
       'æ' = 'ae' COLLATE utf8mb4_general_ci,       -- 0
       'æ' = 'ae' COLLATE utf8mb4_unicode_520_ci,   -- 1
       'æ' = 'ae' COLLATE utf8mb4_german2_ci        -- 1

It seems likely your default collation is one of the last two or some other collation that handles that equality test the way you don't want it.

You can see your connection's collation setting with this statement. I suspect it is utf8mb4_unicode_520_ci.

SELECT @@collation_connection;

Be sure to define the collation for your columns with one you do want, and set your connection collation to the same thing. utf8mb4_unicode_ci is suitable. Try this.

SET collation_connection = 'utf8mb4_unicode_ci';
SELECT 'æ' = 'ae'   -- 0;

It's hard to give more specific advice without understanding your linguistic requirements better.

More info here: Difference between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci collations in MariaDB/MySQL?

O. Jones
  • 103,626
  • 17
  • 118
  • 172
-1

Coalition 'utf8mb4_unicode_ci' is the current one you want to use. Make sure you're setting your client (ie php, node. python) to use the correct charset as well (both in the db client object and the environment config).

Mason Stedman
  • 613
  • 5
  • 12
  • So what you are saying is that for string comparison and sorting, one should use **charset** and not **collation**? – N.B. Mar 02 '22 at 12:37
  • 2
    Coalition and Charset are different things. A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. You need extended Unicode characters, that has little to do with coalition. – Mason Stedman Mar 02 '22 at 12:38
  • Thank you for explaining it to me, it wasn't obvious from my comment that I knew what you wrote. Much appreciated, have a great day! :) Edit: Coalition is a temporary alliance, I'm not sure where it's set in MySQL or any other DB> – N.B. Mar 02 '22 at 12:39
  • To clarify, you still want to make sure you're setting the correct coalition for the charsets you're using especially if they're disparate charsets. utf8mb4_unicode_ci includes most common language stuff, if you need to store crap like emojis you'll want bin. – Mason Stedman Mar 02 '22 at 12:41
  • Also feel free to mark correct if that solves it! – Mason Stedman Mar 02 '22 at 12:45
  • Ok, getting serious because you're not a bad person - when answering, it's the obligation of you - the expert and authority (since you are capable of answering) to clarify the solution and help the one asking know what's used for what. That way, you solidify what you know and you help the others that will eventually arrive here via Google. I appreciate the exchange between us, as you can tell - I can google too. The way I improve answers is by steering people, not by editing answers and filling in the gaps. You fill in the gap and beat my sarcastic comment with knowledge and professionalism. – N.B. Mar 02 '22 at 12:47
  • @N.B. coalition is a set of rules for interpretation. Sorting \ Indexing is literally 1's and 0's in order. SQL doesn't "interpret" sorting. SQL interprets incoming input, then translates it into it's table\logic schema based on the rules you set. æ takes more bytes than the the current charset in the above example, so it splits it into 2 chars when saving it. Switching to the correct charset \ coalition combo will fix that. – Mason Stedman Mar 02 '22 at 13:19
  • 1
    It is not co**a**lition, it is co**ll**ation. It seems there's a language barrier present, I'll excuse myself from further communication. Your answer is wrong, sadly, and thus the downvote. Please take a look at O. Jones's answer to see why. – N.B. Mar 02 '22 at 13:37
  • I'm typing on my phone, so pardon the autocorrect errors. His answer is literally my answer with some expansion. I said to use -collation- 'utf8mb4_unicode_ci'. His answer says that's correct. I asked for further clarification if that didn't work because it could be client related.... but it would have worked. – Mason Stedman Mar 02 '22 at 13:42
  • It isn't. Let's stop. – N.B. Mar 02 '22 at 13:44
  • 'æ' = 'ae' COLLATE utf8mb4_general_ci, -- 0 Which is literally what he asked for. Straight pasted from the answer you say is "correct". His answer definitely explains it better. Yet I'm still not wrong. – Mason Stedman Mar 02 '22 at 13:44
  • Hell the "correct" answer is -literally- what I said to do: SET collation_connection = 'utf8mb4_unicode_ci'; SELECT 'æ' = 'ae' -- 0; – Mason Stedman Mar 02 '22 at 13:50
  • The answers aren't the same. I can type the letters on this site but I can't understand it for you. You're illiterate and you refuse to read, there's no possibility of conversation here. Instead of bothering me in comments and highlighting me, you could have improved your answer, if helping was what you were interested in, instead of collecting reputation points. Also, no means no so when someone asks you to stop - please do. – N.B. Mar 02 '22 at 14:11
  • The conclusion of his answer is -literally- the same as what I said to do. His answer is 100% more in depth, but you said I was factually incorrect. Not even being sarcastic, I like to learn too. What did I say that was actually incorrect so I don't repeat the error. Isn't it on you to assist since you posted? – Mason Stedman Mar 02 '22 at 14:16
  • for `utf8_unicode_ci`, "az" < "æ" < "b". More details: mysql.rjweb.org/utf8_collations.html – Rick James Mar 04 '22 at 05:13