24

Our column is currently collated to latin1_swedish_ci and special unicode characters are, obviously, getting stripped out. We want to be able to accept chars such as U+272A ✪, U+2764 ❤, (see this wikipedia article) etc. I'm leaning towards utf8_unicode_ci, would this collation handle these and other characters? I don't care about speed as this column isn't an index.

MySQL Version: 5.5.28-1

HellaMad
  • 5,294
  • 6
  • 31
  • 53
  • 2
    Mathias posts on here but I happen to just be here and have this bookmarked: http://mathiasbynens.be/notes/mysql-utf8mb4 – Alex Jan 15 '13 at 00:37

1 Answers1

35

The collation is the least of your worries, what you need to think about is the character set for the column/table/database. The collation (rules governing how data is compared and sorted) is just a corollary of that.

MySQL supports several Unicode character sets, utf8 and utf8mb4 being the most interesting. utf8 supports Unicode characters in the BMP, i.e. a subset of all of Unicode. utf8mb4, available since MySQL 5.5.3, supports all of Unicode.

The collation to be used with any of the Unicode encodings is most likely xxx_general_ci or xxx_unicode_ci. The former is a general sorting and comparison algorithm independent of language, the latter is a more complete language independent algorithm supporting more Unicode features (e.g. treating "ß" and "ss" as equivalent), but is therefore also slower.

See https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • I ran this query: `ALTER TABLE mydb.mytable CHANGE mycolumn mycolumn VARCHAR(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;` and I'm still not getting the correct values, do I have to set the table and database charsets as well? – HellaMad Jan 17 '13 at 02:49
  • 1
    No, only columns have charsets, table and database charsets are only default settings if not otherwise specified per column. What does that mean, "you don't get the correct values"? How are you trying to get the correct values? You need to set the correct connection charset front to back, including the client connection which also needs to be set to `utf8mb4`. See [this article](http://kunststube.net/frontback) for in-depth instructions. – deceze Jan 17 '13 at 03:15
  • A Java application is updating the DB, and instead of special unicode characters I'm just seeing a `?` in their place. I don't develop the Java application, I'm guessing it must be escaping the characters itself. – HellaMad Jan 17 '13 at 06:17
  • 1
    That's terrifically vague. :) As laid out in the beforelinked article, encodings must be handled correctly front to back to preserve characters correctly. The problem could be anywhere in the chain *(web?) <-> Java <-> DB driver <-> DB*. If you're not responsible for the Java app, just make sure the database saves and returns data in the correct encoding. – deceze Jan 17 '13 at 06:21
  • I will relay this to the Java developer and see if that solves the problem. I'll give you the bounty though (in 4 hours, when I can award it), thanks for the help! – HellaMad Jan 17 '13 at 22:04
  • Not quite the same, but I figured I would mention this if anybody is using Javascript: I had a problem where a Javascript function was calling ``escape()`` on my strings, prior to handing them off to a PHP script that inserted them into the DB. This caused the strings to be encoded in such a way that accented chars showed up in the DB as, say, u0219. Once I changed ``escape()`` to ``encodeURI()``, the data got passed to the DB correctly. – kmoser Sep 08 '17 at 08:17