-1

I noticed that my schema can't print some strings with regional characters, such as è, ù and other accents or symbols.

The manager app is a Java servlet, it has no such problems. It uses the jdbc driver. So I thought the cause could be this default collation: utf8 - utf8_general_ci.

After a research I discovered that these characters can't be saved within the utf8 bytes. Should I use utf8-mb4, utf-16, utf-32 or another? Which is the minimal best one to support all european chars (no cyrillic, arabic and asian)?

For example, this chosen answer suggest utf8mb4_unicode, but I don't see if it's really the minimal best to cover all the characters I need.

What's the difference between utf8_general_ci and utf8_unicode_ci

Community
  • 1
  • 1
user3290180
  • 4,260
  • 9
  • 42
  • 77
  • A collation is used to sort/compare strings. An encoding/charset is used to encode characters into bytes. UTF8 is an encoding, and it supports every possible unicode character. Your research seems to have led you to an incorrect conclusion. You should tell precisely, with code, what you're doing, what you expect to happen, and what happens instead. – JB Nizet Jun 19 '16 at 11:36
  • utf8 and utf16 basically cover the same characters (including your accented symbols), just the encoding is different, utf8 doesn't mean "8 bit", but "minimim of 8 bit"; utf8mb4 in mysql adds the utf8-standard 4th byte, you should use it for compatibiliy (it will use less bytes if you don't need them all). The '_unicode_ci' or `_general_ci` just regard sorting (you should use `_unicode_ci`). You probably have some encoding probems somewhere in your chain (strings in java are utf16, you might have to set the correct utf8-encoding in your driver/import/bytestream/output client). – Solarflare Jun 19 '16 at 12:01

1 Answers1

4

One should use CHARACTER SET utf8 or utf8mb4 for the encoding. utf8 covers all of Europe and most of the rest of the world. utf8mb4 covers all the worlds languages. utf8 is a subset of utf8mb4.

One can use different COLLATIONs depending on the ordering you desire. Spanish, for example, (with utf8_spanish2_ci or utf8mb4_spanish2_ci) plays games with ll that other languages do not. utf8_latvian_ci treats Ķ as a different character than K; others do not.

If you are not worrying about detailed language differences, then I recommend ..._general_ci or ..._unicode_ci or ..._unicode_520_ci if you have the latest version of MySQL. These three collations primarily differ as follows:

  • general: Only one character is tested for comparisons. This does not allow for ll to be treated as a separate letter. This one is slighty faster.
  • unicode: This is derived from an older Unicode standard. This handles "combining" accents 'correctly'.
  • unicode_520: This is based on a newer standard. Emojis are treated as distinct.
  • unicode_...: More may come in later versions of MySQL.

It does not matter what your application does, however, you must

  • Tell MySQL what encoding the client has: ?useUnicode=yes&characterEncoding=UTF-8
  • Establish CHARACTER SET utf8 (or utf8mb4) on each column or table.
  • If you are using web pages, set charset=UTF-8 in the meta tag.
Rick James
  • 135,179
  • 13
  • 127
  • 222