1

I have a user table in MySQL 5.7.27 with utf8mb4_unicode_ci collation.

Unfortunately, ı is not threaded as i for example, the below query won't find Yılmaz

select id from users where name='Yilmaz';

I do not have the problem with other umlautes like ä and a. For example the two queries give the exact same result.

select id from users where name='Märie';

select id from users where name='Marie';

I cannot simply replace ı by i and to the search, because then I would not find users with the name Yılmaz.

Do I have to use different collation to support all umlaute?

Here are some more information about the unicode letters:

code    | glyph |decimal |  html   | description
U+0131  |  ı    |305     |ı |  Latin Small Letter dotless I
U+0069  |  i    |105     |-        |  Latin Small Letter I
Adam
  • 25,960
  • 22
  • 158
  • 247
  • It's difficult to tell just by inspection what those letters are. Could you add the Unicode code points and their names to your question – JGNI Dec 06 '19 at 10:47
  • https://stackoverflow.com/a/33630106/476 – deceze Dec 06 '19 at 10:56
  • @deceze thank you, but I guess if I use Turkish collation, then I will get into trouble with other umlautes? I need to support umlautes from all countries. – Adam Dec 06 '19 at 11:02
  • @JGNI I have attached a detailed description of the unicode umlautes in the question. – Adam Dec 06 '19 at 11:03
  • maybe something like this? https://stackoverflow.com/a/11761321/916000 – Taha Paksu Dec 06 '19 at 11:22
  • @TahaPaksu I tried it, but it didn't work either. I have character set `utf8mb4` so I had to use `utf8mb4_bin`, but `bin` is the most restrictive collocation anyway. – Adam Dec 06 '19 at 11:44
  • did you try the `LIKE` operator only? – Taha Paksu Dec 06 '19 at 11:52
  • @TahaPaksu I tried `SELECT * FROM 'users' WHERE name LIKE ('%Yilmaz%' COLLATE utf8mb4_bin)` – Adam Dec 06 '19 at 11:59
  • But I mean without the COLLATE keyword? The string in the cell is already utf8mb4_bin collated, and when you convert the input collation to utf8mb4_bin, it'll look for an exact match. If you don't specify the collation, maybe it'll try to get all variants to match. – Taha Paksu Dec 06 '19 at 12:02
  • 1
    Ok, new info, you can query with a different collation for that specific query like mentioned here: https://stackoverflow.com/a/2607164/916000. So can you try : `select id from users where name='Yilmaz' COLLATE utf8_general_ci;` – Taha Paksu Dec 06 '19 at 12:05
  • @Taha so only `LIKE` operator did not work either. I tried your last comment and to my surprise, `SELECT * FROM users WHERE name = 'Yilmaz' COLLATE utf8mb4_general_ci` actually works!! But I am highly confused, because the column `name`has collation set as `utf8mb4_general_ci` – Adam Dec 06 '19 at 12:08
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/203750/discussion-between-adam-and-taha-paksu). – Adam Dec 06 '19 at 12:12

1 Answers1

3

Referring to http://mysql.rjweb.org/utf8_collations.html , I see that ı=i in 3 collations: utf8_general_ci, utf8_general_mysql500_ci, utf8_turkish_ci. However, for the turkish collation, I=ı sorts before other accented I's. In all other collations ı sorts after all I's, as if it is treated as a separate letter.

Meanwhile İ=I in all collations except utf8_turkish_ci.

The plot thickens with MySQL 8.0. utf8mb4_tr_0900_ai_ci (only) has this ordering:

I=Ì=Í=Î=Ï=Ĩ=Ī=Ĭ=Į=ı sort before  i=ì=í=î=ï=ĩ=ī=ĭ=į=İ

Meanwhile ä=Ä and they match most other accented A's for most collations (including the Turkish ones).

Bottom line: It seems that utf8[mb4]_general_ci is the only collation in 5.7 or 8.0 that will always treat a dotless-i (or dotted-I) equal to a 'regular i/I and at the same time ignore umlauts.

Caveat: The "general" collations do not test more than one character at a time. That is, a "non-spacing umlaut" plus a vowel will not be treated as equal to the combination.

In that link... The one character æ is sorted the same as the two letters ae for some collations. That's indicated by: Aa ae=æ az. In about half of the other collations, the character æ is treated as a separate letter; this is indicated by it being after az and before b. Or even after zz for Scandinavian collations. This separate letter concept sometimes applies to letter pairs, for example cs (Hungarian) and ch (traditional Spanish).

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • That is perfect, this page is exactly what I was looking for. So in short, there is unfortunately no collation that rules them all. Maybe I need to do something like this: `select * from .. where (a = b COLLATE utf8_general_ci )or where a=b;`? – Adam Dec 06 '19 at 17:21
  • Why are there multiple columns in the table? What means `ae`, `az` ,`bz`etc? – Adam Dec 06 '19 at 17:27
  • @Adam - I added to my answer to explain ae, az, etc. – Rick James Dec 06 '19 at 22:27
  • @Adam - and... True, no single collation. (But the 4 I's in Turkish is a bit tricky.) Using an `OR` with two different collations will lead to a full table scan; that is, no index can be used. – Rick James Dec 06 '19 at 22:37