0

Related question: mysql select distinct letters, including extended latin characters

Nobody bothered to reply about the final issue, so I'm creating a new question.

Using this table data: http://pastebin.com/cH2DUzf3

and executing the following query on it:

SELECT DISTINCT BINARY `letter` FROM `texts` ORDER BY `letter` ASC

yields almost perfect results, with one exception: the letter ū is before u. Can anyone explain that? Logically, the letter u is the basic version of ū and the order should be u, ū. Why is it not so? All the other letters are ordered correctly.

Second problem - executing the following statement on the table:

DELETE FROM `texts` WHERE `letter` = 'ū'

deletes also all texts where the letter is u. Why?

Same problem with querying data with Latvian letters:

SELECT * FROM `texts` WHERE `letter` = 'ā'

returns results where letter is a, not ā.

Edit: ok, the last 2 problems can also be solved by putting BINARY before letter.

Community
  • 1
  • 1
jurchiks
  • 1,354
  • 4
  • 25
  • 55
  • What is the character code of the two mentioned characters? – Lajos Arpad Jun 29 '14 at 15:55
  • I don't know. Where can I check that? – jurchiks Jun 29 '14 at 15:56
  • Which **unicode normalization form** is your data in? What sort-order and character sets are set for the field `letter`? – Deduplicator Jun 29 '14 at 15:57
  • @Deduplicator - what is `unicode normalization form`??? Also, how about checking the pastebin? – jurchiks Jun 29 '14 at 15:58
  • 2
    Please read: http://stackoverflow.com/questions/1036454/what-are-the-diffrences-between-utf8-general-ci-and-utf8-unicode-ci – Deduplicator Jun 29 '14 at 16:00
  • @Deduplicator - well, I had initially used unicode_ci, but it sorted all latvian letters after basic latin ones, i.e. `a,b,c,d,...z,ā,č,ē...`. That is totally not what I want. – jurchiks Jun 29 '14 at 16:03
  • @ÁlvaroG.Vicario - what's your point? You haven't said anything new. As you can see, I have tried many collations, and none work exactly like expected. – jurchiks Jun 29 '14 at 17:08
  • @ÁlvaroG.Vicario - did you not read the part where I said I have tried many collations and none of them work on my data? According to http://collation-charts.org/mysql60/mysql604.utf8_unicode_ci.european.html, `u` does come before `ū`, but my data shows otherwise, and I want to know why. Your "theory" does not help the least bit. – jurchiks Jun 29 '14 at 17:20
  • @ÁlvaroG.Vicario - read the first two lines of my question... How about you just leave this page and never come back? You haven't added absolutely anything of value, only downvoted my question and voted to close it. Edit: and now you even deleted your own comments to hide your tracks. Fantastic. – jurchiks Jun 29 '14 at 17:28
  • You got mad at me when I tried to correct wrong information stated on the question because I was only using the information provided in the question. So: 1) I removed the comments that offended you because my only intention was to help 2) I've voted as dupe because even you clearly insist that it's only a dupe 3) I've downvoted because it's an intentional dupe that doesn't even follow the site guidelines about self-contained question and for your rude manners. — Does this answer your questions or you need more clarifications? – Álvaro González Jun 29 '14 at 18:26
  • 1) What exactly is that "wrong information" that you claim to have tried to correct? 2) It is not a dupe, it is a question that arose when the initial problem was solved. Clearly adding a new question to an already existing one never results in answers, so I created a new question. 3) Please clarify, why are you still here? You still haven't helped, quite the contrary. – jurchiks Jun 29 '14 at 18:45

1 Answers1

1

You are using utf8_general_ci as your collation. It uses a fast but inaccurate comparison algorithm, which simply ignores accents over letters. It works okay-ish for English, but fails for most other languages.

Use something like utf8_latvian_ci or utf8_unicode_ci instead.

utf8_unicode_ci uses Unicode collation algorithm. It's also the slowest.

utf8_latvian_ci uses simplified collation algorithm focused on supporting Latvian alphabet: http://collation-charts.org/mysql60/mysql604.utf8_latvian_ci.html

EDIT: Okay, I was wrong. There is no built-in collation that would make MySQL sort accented letters after their unaccented variants while also preserving alphabetical order.

You can write your own collation, using the guide here: http://dev.mysql.com/doc/refman/5.0/en/ldml-collation-example.html

Karol S
  • 9,028
  • 2
  • 32
  • 45
  • Well, that's disappointing. But the thing is, I don't just have English and Latvian characters, I also have Russian ones, and there could appear more in the future (for example, Lithuanian/Estonian). Using `utf8_latvian_ci` just feels wrong. – jurchiks Jun 29 '14 at 16:05
  • Russian will work in all variants, including utf8_general_ci. I guess utf8_unicode_ci will be the safest bet then. I'll expand my answer. Note that in Lithuanian `y` comes between `į` and `j`, not `x` and `z`. – Karol S Jun 29 '14 at 16:08
  • Changed to `utf8_unicode_ci`, deleted/created the table, generated data anew - still `ū` before `u`. – jurchiks Jun 29 '14 at 16:12
  • According to http://collation-charts.org/mysql60/mysql604.utf8_unicode_ci.european.html, `u` (0055) comes before `ū` (016A). What could possibly explain this behavior? – jurchiks Jun 29 '14 at 16:18
  • It's in the same row, so it's considered equal. – Karol S Jun 29 '14 at 16:26
  • That doesn't make any sense, it should order them not just by the row but also by the character code... – jurchiks Jun 29 '14 at 16:41
  • ok, this is weird, if I use manually written test data (http://pastebin.com/CC1tGgUh), the `SELECT DISTINCT BINARY letter FROM texts` query returns the results ordered correctly, even though I'm using `utf8_unicode_ci`. What could that mean? – jurchiks Jun 29 '14 at 17:13
  • @jurchiks: I think the chart explains the Unicode language-neutral algorithm. Proper sorting requires a language-specific algorithm. – Nisse Engström Jun 29 '14 at 17:15
  • @NisseEngström - there can be no specific language, there are multiple languages in the table. – jurchiks Jun 29 '14 at 17:16
  • @jurchiks: Tough luck. – Nisse Engström Jun 29 '14 at 17:28