4

I have a column called username, i want the user to be able to insert text in japanese, roman, arabic, korean, and everything that is possible, including special chars [https://en.wiktionary.org/wiki/Index:All_languages], what COLLATE should i set on my database and tables?

I'm using utf_general_ci, i'm new so i don't know if this is the best COLLATE for my needs. I need to choose the right COLLATE to avoid sql error, because i will not use preg_replace or a function to replace special chars, i will only use prepared statement to avoid SLQ injection and protect by database.

mind set
  • 133
  • 10
  • To use *actual* UTF-8 in MySQL you can pick any collation that shows up when you run `SHOW COLLATION LIKE 'utf8mb4%'` in your server. But one thing is charset and another thing is collation. For that, you need to figure out your preferred [case/accent sensitivity](https://dev.mysql.com/doc/refman/8.0/en/charset-collation-names.html), as well as the advanced rules for character comparison and sorting. – Álvaro González Jan 04 '19 at 18:10

3 Answers3

5
  • First choice (MySQL 8.0): utf8mb4_0900_ai_ci
  • Second choice (as of 5.6): utf8mb4_unicode_520_ci
  • Third choice (5.5+): utf8mb4_unicode_ci
  • Before 5.5, you can't handle all of Chinese, nor Emoji: utf8_unicode_ci

The numbers refer to Unicode standards 9.0, 5.20, and (no number) 4.0.

No collation is good for sorting all languages at the same time. Spanish, German, Turkish, etc, have quirks that are incompatible. The collations above are the 'best' general purpose ones available.

utf8mb4 handles all characters yet specified by Unicode (including Cherokee, Klingon, Cuneiform, Byzantine, etc.)

If Portuguese is the focus:

See https://pt.stackoverflow.com/ and MySQL collation for Portugese .

Study this for 8.0 or this for pre 8.0 to see which utf8/utf8mb4 collation comes closest to sorting Portuguese 'correctly'. Perhaps utf8mb4_danish_ci or utf8mb4_de_pb_0900_ai_ci would be best.

(Else go with the 'choices' listed above.)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • interesting, i will seach a bit about `utf8mb4_0900_ai_ci`. And my mysql version is `10.1.37`. – mind set Jan 05 '19 at 01:36
  • 1
    MariaDB 10.1 approximates MySQL 5.6, so the 'best' for you is `utf8mb4_unicode_520_ci`. (I just now checked 10.1.33.) – Rick James Jan 05 '19 at 02:18
  • I have 2 questions, 1 - how do you choose the best `COLLATE` for a MySQL version? I made some reserch and i will use `utf8mb4_unicode_520_ci` and i will see if its good. 2 - Should i add `setlocale(LC_TIME, 'pt_BR', 'pt_BR.utf-8', 'pt_BR.utf-8', 'portuguese'); //ppl said that i should use the language of my country` on top of my file? I think it could help to replace special chars, but iam not sure, i just meet this function some days ago – mind set Jan 05 '19 at 10:43
  • Actually i asked if it's recommended to use `setLocale()`, i mean if there's one words that a `COLLATE` can't handle, it will print out on the screen a sql error, so to avoid this error `setLocale()` maybe could help (i'm not sure about it). But anyways it's a topic for another question, thanks – mind set Jan 06 '19 at 01:17
  • @user236945896 - Screen rendering is a different issue. What OS, what 'screen'? – Rick James Jan 06 '19 at 01:27
  • This only could happens on a `SELECT` query that the user have permition to modify the values on the `WHERE` clause. Example: if your `COLLATE` is `latin1_swedish_ci` and you have a query like this: `SELECT * FROM table WHERE column = '壱げ'` (this is just a short example, i use prepared statements with placeholders), this will print out on the screen a error https://pastebin.com/YAbEHWgm this happens because `壱` is not supported by `latin1_swedish_ci`. – mind set Jan 06 '19 at 01:58
  • This is why i made this question, even though `Illegal mix of collations` error does not print out any private info, i wanted to avoid this kind of error. I will choose one of the `COLLATE`s that you listed above. – mind set Jan 06 '19 at 01:58
  • @user236945896 - Either _connection_ parameters or the column definition is saying latin1. Show us both. Plan on moving everything to utf8mb4, away from latin1. – Rick James Jan 06 '19 at 02:01
  • Let me just add that for Portuguese characters the correct collation is latin1_general_ci – FilT Jun 03 '20 at 21:48
  • @FilT - What are the extra characters and rules for sorting Portuguese? – Rick James Jun 03 '20 at 21:53
  • Sorry, meant to add that in case you are using pre 8.0 the best one is latin1_general_ci. Otherwise you do not have utf8mb4_de_pb_0900_ai_ci available which might indeed be the best – FilT Jun 03 '20 at 21:55
  • MariaDB 10.10 -- and even newer Collation. – Rick James Jan 30 '23 at 05:28
3

If you are using MySQL 5.5.3 or higher, I would recommend UTF-8 character encoding utf8mb4_unicode_ci . AFAIK it supports most, if not all languages, and implements the Unicode standard for sorting and comparison. As a second choice, have a look at utf8mb4_general_ci, which may be faster but also less accurate.

See this excellent SO post for (many) more details, or check out the official MySQL doc.

Below 5.5.3, utf8_unicode_ci is your friend.

GMB
  • 216,147
  • 25
  • 84
  • 135
0

COLLATION refers to ordering (as in comparisons in WHERE and ORDER BY); you should really ask about CHARACTER SET:

  • Pre-5.5.3: utf8 (aka utf8mb3) handles all languages, except for a few Chinese characters and Emoji.
  • 5.5.3 forward: utf8mb4 - Handles everything. Outside of MySQL, it is spelled "UTF-8".
Rick James
  • 135,179
  • 13
  • 127
  • 222