0

I have an issue with unicode chars in an utf8mb4_general_ci table

SELECT * FROM `t1` WHERE c1='musca' 

returns

musca
muşca
muşcă

What I would like to know is if this is a bug - sounds like it;

and if it affects searching - it might, or better said it should; I can't make the column unique index

Anything I should do so mysql would consider a and ă and s and ş as different entities? (probably a and â, t and ţ, and i and î as well, but I haven't checked).

Should I store unicode chars as &#226 &#259 &#351 &#355 &#238 ?

I will need to retrieve the exact match of the user input.

Edited to add:

The answer is in the comments: I should collate the columns as utf8mb4_0900_as_cs as Madhur Bhaiya explained and demonstrated

flish
  • 596
  • 1
  • 6
  • 17
  • Possible duplicate of [Detecting utf8 broken characters in MySQL](https://stackoverflow.com/questions/1476356/detecting-utf8-broken-characters-in-mysql) – Sonal Borkar Nov 28 '18 at 10:58
  • There are no broken chars though. Everything looks fine, except that if I try to make the column unique it's not possible, since per my example a = ă and s = ş – flish Nov 28 '18 at 11:01
  • @flish What specific language do these characters belong to ? MySQL has added a lot of language specific collation recently. – Madhur Bhaiya Nov 28 '18 at 11:09
  • 1
    @flish check this blog from MySQL team: https://mysqlserverteam.com/mysql-8-0-1-accent-and-case-sensitive-collations-for-utf8mb4/ – Madhur Bhaiya Nov 28 '18 at 11:10
  • 1
    @flish Check this: https://www.db-fiddle.com/f/rV3EgJguKG97kXF28sFhmf/0 I used `utf8mb4_0900_as_cs` collation. – Madhur Bhaiya Nov 28 '18 at 11:12
  • Thank you. I'll start on my reading. So is this question a duplicate? Should I click on the button and close it? EDITED to add: Also, these chars belong to Romanian, but I will need to add French and German as well and I thought general_ci would work for all. – flish Nov 28 '18 at 11:18
  • @flish The _character set_ utf8 or utf8mb4 will work for all European languages. A character set refers to the _encoding. The longer things mentioned are _collations_; they refer to how to order the characters, such as wanting `a < ă` instead of `a = ă`. – Rick James Nov 29 '18 at 00:36

1 Answers1

1

You need COLLATION utf8_romanian_ci (or utf8mb4_romanian_ci) on the table columns in question. It is the only collation that treats those 5 characters as a separate 'letter'.

Reference: http://mysql.rjweb.org/utf8_collations.html

That is available in most versions of MySQL/MariaDB. There is no need for utf8mb4_0900_as_cs, which implies MySQL 8.0.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • It works, yes and I've accepted it as the answer. I will still need to install MySQL 8 because the same table will also contain French and German characters and it won't work. I just checked é and e – flish Nov 29 '18 at 15:44
  • @flish - I don't see the issue with `é`. It collates equal to `e` in all collations except Icelandic for both MySQL 5 and 8. I'm pretty sure the rest of French an German characters also work the same between 5 and 8. Please elaborate. – Rick James Nov 29 '18 at 20:37