2

Both utf8mb4_unicode_ci and utf8mb4_danish_ci (mysql/mariadb database) seems to work fine on my danish website.

When I test for characters/symbols/emojis, they both seem to display the same result.
Danish letters like æ/ø/å works in both.

Why would I choose one charset/collation over the other?

Does utf8mb4_danish_ci add characters/codes or is it actually more narrow than utf8mb4_unicode_ci?

CKE
  • 1,533
  • 19
  • 18
  • 29
mowgli
  • 2,796
  • 3
  • 31
  • 68
  • 2
    The charset is `utf8mb4`. `utf8mb4_danish_ci` is the collation which "is a set of rules for comparing characters in a character set". – user3783243 Jul 18 '18 at 03:52
  • So why choose utf8mb4_unicode_ci over utf8mb4_danish_ci or reversed? – mowgli Jul 18 '18 at 04:04
  • It is for comparison, not storage. `characters/symbols/emojis` will not be an issue, nor `characters/codes`. They both support the same character set. https://stackoverflow.com/questions/367711/what-is-the-best-collation-to-use-for-mysql-with-php – user3783243 Jul 18 '18 at 04:12
  • Ok. I doubt I will be comparing utf8 characters/codes, only iso characters. The utf8mb4 is just to give users the ability to use emoji's etc. in a text field on my website. So I guess it's more a matter of performance? (in my case) – mowgli Jul 18 '18 at 04:20
  • Yup, or my understanding of it at least. I'll leave it to someone with more experience in case there are nuisances I'm missing though. – user3783243 Jul 18 '18 at 04:25
  • part of SHOW COLLATION WHERE Charset = 'utf8mb4'; result:utf8mb4_danish_ci | utf8mb4 | 235 | | Yes | 8 | PAD SPACE | utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | PAD SPACE | they are same – yaoweijq Jul 18 '18 at 08:55
  • An example I like is Spanish: `ch` used to be a letter of its own so dictionaries looked like `canada corea china`. The rule changed and it was split into c and h so dictionaries were updated to `canada china corea`. Same language, same characters, same encoding but two different collations to choose from ("traditional" and "modern"). – Álvaro González Jul 18 '18 at 18:50
  • @yaoweijq - The meat is behind the column with `235` vs `224`; the collations _are_ different. And especially different in the 3 'characters' the OP mentioned. – Rick James Jul 19 '18 at 15:42
  • @mowgli - "Encoding" is one thing; "Collation" is another. `ch` is _encoded_ the same in all `CHARACTER SETs`, but sorts different in `utf8_spanish2_ci` (and czech, lithuanian, and slovak collations). In utf8_lithuanian_ci, `c=ch`. More: http://mysql.rjweb.org/utf8_collations.html – Rick James Jul 19 '18 at 15:49
  • (For this European languages, character sets `utf8` and `utf8mb4` are identical, except for name.) – Rick James Jul 19 '18 at 15:51
  • @RickJames 235 vs 224 is only the identifier, i think actually they are the same, as other columns values are all the same. – yaoweijq Jul 20 '18 at 06:12
  • @yaoweijq - But all the collation information is behind that identifier, in some file or code somewhere. The _encoding_ is the same (both are utf8mb4); the _collation_ is different. – Rick James Jul 20 '18 at 14:18

1 Answers1

2

To distinguish and sort Emoji, I think you need utf8mb4_unicode_520_ci, which is based on Unicode 5.20. utf8mb4_unicode_ci is based on the older 4.0. (I don't know _danish_ compares Emoji.) MySQL 8.0 will introduce the 9.0 version: utf8mb4_0900_ai_ci.

For utf8mb4_danish_ci: æ/ø/å each sort after z. That is y < z < æ < ø < å

For utf8mb4_unicode_ci, æ acts like a letter between a and b'; ø between o and p; å=a

For utf8mb4_unicode_520_ci and utf8mb4_0900_ai_ci: æ=ae; ø=o; å=a

It all depends on whether you care about how those three characters compare (WHERE) and sort (ORDER BY) as above.

More details and more collations: http://mysql.rjweb.org/utf8mb4_collations.html

You did not ask, but it seems that danish says Ä=Æ=ä=æ < Ö=Ø=ö=ø < Aa=Å=å < Þ=þ; those seem to cater toward Icelandic and Swedish.

Rick James
  • 135,179
  • 13
  • 127
  • 222