What CHARSET/COLLATE I should use to read/find exact results from non-English characters

Question

CREATE TABLE `post` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `title` varchar(750) COLLATE utf8mb4_unicode_ci NOT NULL,
  `slug` varchar(750) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `post_slug_unique` (`slug`),
) ENGINE=InnoDB AUTO_INCREMENT=317940 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

What CHARSET/COLLATE I should use to store/search non-English characters instead of utf8mb4_unicode_ci in mysql?

This answer had satisfactory explanation and reasoning to use What's the difference between utf8_general_ci and utf8_unicode_ci?

But I am unable to know and fix, why following simple Query: q1 giving my undesired result

 select slug from post where slug like '%آ%' order by id limit 1;

Result: گورنر-چودھری-محمد-سرور-کی-چودھری-برادران-سے-ملاقات-سیاسی-اور-حکومتی-اُمور-پر-گفتگو

I expected the result to be empty (no rows) with q1 and above result should only be obtained by following query:q2

 select slug from post where slug like '%اُ%' order by id limit 1;

Diagnosis: I tried a hit and trial diagnosis and found following

select ASCII('آ'); => 216
Also
select ASCII('اُ'); => 216

So here I am stuck , apparently they are not same characters but giving me same character code. So Mysql is giving the results rightfully, but what different charset I can use to store my data so I can I get my desired result (Empty => no rows) from given Query:q1.

If these characters have the same character code in your data, then there is nothing you can do as in the end, character codes are used to determine if two characters are the same. You need to find a way for your application to generate a different character code for the two different characters. — Shadow, Jan 31 '22 at 08:20
Btw, the two do not have the same character code. The first one is https://www.compart.com/en/unicode/U+0622, the second one is https://www.compart.com/en/unicode/U+0627. Since the letters are Arabic, you probably need an Arabic collation for utf8mb4. show collations statement will tell you what collations are supported by your mysql and pick the right one. — Shadow, Jan 31 '22 at 08:30
What does "exact results" mean in natural language context? Would you like that "Book" will not find "book"? Or "café" will not find "cafe"? Or that acute accent unicode symbol followed by lower case a is considered different than single character `á`? — Álvaro González, Jan 31 '22 at 08:32
@Shadow, sorry it was wrong copy, just corrected the character and checked with mysql now => select ASCII('اُ'); and select ASCII('آ) both give 216; — Sami, Jan 31 '22 at 08:39
@Álvaro González no sir, exact means, exact match, I mean only q2 => `select slug from post where slug like '%اُ%' order by id limit 1;` (question edited) should give me the shown result anf q1 should give no rows — Sami, Jan 31 '22 at 08:48
I thought utf8_unicode_ci works well with all languages and it did for me as well, until it came to some specific characters, so as @Shadow guided I tried to find the right choice, but could not find yet. however I found https://stackoverflow.com/questions/500826/how-to-conduct-an-accent-sensitive-search-in-mysql#answer-501151 `WHERE col_name = 'val' collate utf8_bin` which is giving me exact match, though I am not yet willing to have case sensitive search, but found a way at least to not being stuck — Sami, Jan 31 '22 at 08:58
@Sami ascii() function just returns the first byte of a multi-byte character code, so it is only useful for latin characters and punctuation. The general collations will not always get the nuances of specific languages right. — Shadow, Jan 31 '22 at 09:15

Rick James · Accepted Answer · 2022-01-31T18:16:31.143

Don't use ASCII, use HEX

SELECT HEX('آ'):  D8A7D993
SELECT HEX('اُ'):  D8A7D98F

Further analysis shows that each "character" is really 2 characters, the first being "ALEF" (hex D8A7) and the second being a different "Non-spacing mark".

Unicode has specified (correctly or incorrectly) whether the accented Alefs should compare equal. And similarly for any other "marks" that can be applied to Alef.

Unicode is evolving. MySQL 8.0 has implemented Unicode version 9.0.0 in each of the utf8mb4_% collations (except utf8mb4_bin).

If you are not using 8.0, then you are stuck with utf8mb4_unicode_520_ci, which implements the older Unicode 5.2.0. You mentioned utf8mb4_unicode_ci, which is based on the even older Unicode 4.0.0. You should test with newer collations.

(What I say for character set utf8mb4 and collations utf8mb4% applies to utf8, since you seem to be using such.)

I am not an expert of either Arabic or Unicode, but perhaps this gives you some clues of where to look next.

Meanwhile, do you have a complete list of what Alef+mark 'characters' should or should not compare equal?

Note also that MySQL made significant changes in REGEXP with 8.0; before that, REGEXP was quite lame with it came to character sets. Meanwhile LIKE works differently from REGEXP. So you might try the equivalent REGEXP 'اُ' (no wildcard needed in this case).

One more thing, pre-8.0 mostly had _ci collations, which ignored case and accents (marks) when comparing. 8.0 introduces some _as (Accent Sensitive) collations. That would, I expect, treat the various Alefs as different.

What CHARSET/COLLATE I should use to read/find exact results from non-English characters

1 Answers1