25

I'm currently developing a website that is going to show stuff for almost any language in the world. And I'm having problems choosing the best collation to define in the MySQL.

Which one is the best to support all characters? Or the most accurate?

Or is just best to convert all characters to unicode?

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Pedro Luz
  • 2,694
  • 4
  • 44
  • 54

5 Answers5

34

The accepted answer is wrong (maybe it was right in 2009).

utf8mb4_unicode_ci is the best encoding to use for wide language support.

Reasoning and supporting evidence:

You want to use utf8mb4 rather than utf8 because the latter only supports 3 byte characters, and you want to support 4 byte characters. (ref)

and

You want to use unicode rather than general because the latter never sorted correctly. (ref)

Gerbus
  • 2,554
  • 27
  • 22
  • 2
    Thanks! But what is the disadvantage of doing this by default for every db / table? Does it use more space or will it make my queries / searching inefficient compared to using the default mysql setting (latin1 i guess) – supersan Apr 29 '20 at 13:56
23

I generally use 8-bit UCS/Unicode transformation format which works perfect for any (well most) languages

utf8_general_ci

http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html

stone
  • 2,192
  • 16
  • 26
  • 6
    I'd like to suggest using utf8_unicode_ci instead of utf8_general_ci. For more information about why unicode is better than general @ http://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci – Aistis Aug 07 '14 at 07:56
0

Use utf8mb4 instead of utf8

utf8mb4_general_ci => support 1, 2, 3 or 4 bytes

and

utf8_general_ci or utf8mb3_general_ci => support 1, 2 or 3 bytes

It will take space on ur disk as required.

Deepak Kumar
  • 413
  • 1
  • 4
  • 11
0

Use utf8mb4_unicode_ci or utf8mb4_general_ci can be tricky and cause unexpected behaviors.

Be aware.

Perhaps utf8mb4_unicode_bin can be a good option if you want to avoid cases like this one below.

enter image description here

FabianoLothor
  • 2,752
  • 4
  • 25
  • 39
0

From mysql web site :

utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.

utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character. This character set is deprecated in MySQL 8.0, and you should use utfmb4 instead.

utf8: An alias for utf8mb3. In MySQL 8.0, this alias is deprecated; use utf8mb4 instead. utf8 is expected in a future release to become an alias for utf8mb4.

So prefer to use utf8mb4

Suresh
  • 1
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Sep 21 '22 at 11:04