27

I want a UTF8 collation for supporting:

  • English
  • Persian
  • Arabic
  • French
  • Japanese
  • Chinese

Does UTF8_GENERAL_CI support all these Languages?

Oki Erie Rinaldi
  • 1,835
  • 1
  • 22
  • 31
armin etemadi
  • 519
  • 3
  • 8
  • 18
  • -1 There is no one best answer! 'Collation' is sorting, not collecting. Each of these languages must be sorted appropriately for that language. While there might be several ways to sort French for example, the sorting of French is not better than the sorting of Chinese. It's like which is better an apple or an orange? There is no best answer to that. – Elliptical view Jun 10 '20 at 22:10

2 Answers2

41

Yes, that is correct. UTF-8 is an encoding for the Unicode character set, which supports pretty much every language in the world.

I think the only difference comes with sorting your results, different letters might come in a different order in other languages (accents, umlauts, etc.). Also, comparing a to ä might behave differently in another collation.

The _ci suffix means sorting and comparison happens case insensitive.

http://www.collation-charts.org/ might be of interest to you.

knittl
  • 246,190
  • 53
  • 318
  • 364
  • 2
    Thank you so much buddy :) one more question : you mean if i select utf8_general collation,then it will be a problem to sort my records both in English and Persian or other Langs? – armin etemadi Apr 24 '10 at 08:06
  • 2
    english and french should sort pretty much the same, i don't know about the other ones (persian, arabic, japanese, chinese), because they don't use the normal english characters. how do you like them to sort? after english letters, inbetween, before? know what i mean? – knittl Apr 24 '10 at 08:16
  • no,i mean to sort persian chars in its way. it means the same way as sorting english alphabets like A,B,C,... is it gonna work wrong with this collation? – armin etemadi Apr 24 '10 at 10:51
  • 1
    collations can be changed after database/table creation, so it shouldn't really be a problem to select a different one if the sorting goes wrong. but i guess it will work the way you intend it to – knittl Apr 24 '10 at 11:33
  • 1
    @knittl: *collation* is **always** about sorting. so your answer is a bit by-passing the question which is a pitty as the question is well found in google ... :/ – hakre Jul 25 '13 at 12:46
  • @hakre: the question is about which collation supports all of the given languages. My answer also states that sorting behaves differently depending on the collation. What are you missing specifically in my answer? – knittl Jul 25 '13 at 13:45
  • To make clear why the decision of that one is best for the languages given in question. It's not clear from your answer which first talks about encoding albeit the question is about collation. – hakre Jul 25 '13 at 14:02
  • 1
    "The _ci suffix means sorting and comparison happens case insensitive." Thanks very much for this. – felwithe Mar 08 '20 at 17:55
8

As UTF8_GENERAL_CI was a good decision some time ago. It has some drawbacks now.

MySQL's UTF8 actually uses 3 bytes instead of 4, which you need for symbols like emojis and new asian chars.

So MySQL has a newer charset called utf8mb4 which actually complies with UTF8 definition.

To be able fully support Asian languages you will need to choose utf8mb4.

If you care about correct sorting in multiple languages, use utf8mb4_unicode or utf8mb4_unicode_ci instead general.

A more detailed answer you can find in What's the difference between utf8_general_ci and utf8_unicode_ci

Aistis
  • 3,695
  • 2
  • 34
  • 34