9

I'm confused about the MySQL Collations and its Encodings, People usualy uses one of these three collations:

  1. utf8mb_bin
  2. utf8mb4_general_ci
  3. utf8mb4_unicode_520_ci

What I don't understand is when to use each of these collations when needed, Like for example,

A table for names like this

[id - name]

It would only has names characters from different languages like french, german, latin. . .

Do I use for such a table the collation of utf8mb_bin or stick with utf8mb4_unicode_520_ci,

In other hand, A table for topics of a blog for example:

[id - title - subject]

Do I put all the columns collation to utf8mb4_unicode_520_ci or use:

utf8mb_bin for title

utf8mb4_unicode_520_ci for subject

Since as I understood, the utf8mb4_unicode_520_ci has some emotes in it that would be used in blogs subjects, Or do I just ignore everything and just use utf8mb4_unicode_520_ci on all?

But overall, What is the point of using these different collations? And How does they affect my results in SELECT queries?

What I would like to know in berif is:

What collation to be used for each of:

  1. names
  2. titles
  3. subjects
  4. emails
  5. bios
  6. messages
  7. usernames
Toleo
  • 764
  • 1
  • 5
  • 19
  • Duplicate of https://stackoverflow.com/questions/2344118/utf-8-general-bin-unicode -- which talks about `utf8`; however, `utf8mb4` and `utf8` have the same characteristics relative to this Question. – Rick James Jul 19 '18 at 21:35

1 Answers1

15

You're confusing encoding and collation.

The available characters are defined by the encoding (and only the encoding). Since UTF-8 is a Unicode-compatible encoding, you have all characters. The MySQL peculiarity is that its utf8 encoding does not really implement UTF-8 but only a subset because it allocates 3 bytes per character and (as of today) some characters need 4 bytes. Thus utf8mb4 was born.

Collation is a set of rules that tell you how WHERE foo = bar and ORDER BY foo work. You need to ask yourself: if I search for internet, should it match Internet? If you store French, German and Latin words you most likely don't want a binary collation. Ideally you want one with the exact rules of the language you'll be using but, since you're mixing languages, you'll have to opt for a generic collation. You can make an informed decision after reading Difference between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci collations.

Álvaro González
  • 142,137
  • 41
  • 261
  • 360
  • 1
    So you mean that If I want accuracy in search then `Binary Collation` is the best to be used, Yet If It is for something I want it to be searchable and easy to reach as for `í` to be looked at as `i`, Then `unicode_ci` is better for that, Since it is case insenitve? – Toleo Jul 20 '18 at 01:38
  • I'm not saying that at all. Binary means that `e` is totally different from `E` or `é` and the `é` character is different from `e` combined with `´`. Is that more accurate? It depends on what you need. – Álvaro González Jul 20 '18 at 06:52
  • 2
    Then what is the different between these collation in usage exactly? Why would & **when** someone use `Binary` over `unicode_ci` and the vice-versa? – Toleo Jul 20 '18 at 16:44
  • 3
    Yes e.g. you would use _bin on a column which holds email addresses. With a unique index on those email addresses you would assure that its possible to allow "müller@example.com" and "muller@example.com" or "rené@example.com" and "rene@example.com" as valid unique entries. And yes if you would like to implement a search it would be benefitting to use unicode_ci on a column which just will be search to return "rené" and "renè" if you just search "rene". – liqSTAR Mar 18 '22 at 11:04