6

I've been asked to enable Emoji support for an APP backed by a PHP API. The APP is currently iPhone only (i don't have one, but i'm assuming it has Emoji's on it?).

Anyway, i noticed the database for some reason uses latin_swedish everywhere. But since i wasn't sure if utf-8 could support the 4 byte character strings required for the full emoji range, i started googling, but couldn't realy get a full answer from the results.

So:

  1. To support Emoji's, do the charset's/collation's need setting to utf-8 in mysql, or utf-8 mb4?

  2. If charset needs setting to utf8mb4, what is the difference between utf8 and utf8mb4 (utf8 supports up to 4 bytes anyway doesnt it?). Does it force characters to be stored in 4 byte representations at a fixed width (assuming requiring 4x more storage space per chatacter even on the standard ascii range which would normally be 1 byte).

  3. Can utf8 be compared to utf8mb4 in mysql queries? What if i try to do a full text search, or a where clause on a utf8mb4 charset against a utf8 column of another table?

  4. Does PHP support 4byte strings without having to use a special library like mb_string? i.e. can i just assign $var = $_POST['text'] and do things like $emoji_var == 'xxxx' or do i have to literally change all strings in PHP to use mbstring and change all comparitors e.c.t.

Just trying to work out how much work is involved in having emoji support, and any caveats of doing so. So any help would be great.

Lee
  • 10,496
  • 4
  • 37
  • 45
  • 2
    Re 2.), reading: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html UTF-8 in mySQL can store only characters from the Basic Multilingual Plane. – Pekka Sep 19 '13 at 11:33
  • Right, bit strange how it only allows 3 bytes. I'm sure theres a technical reason that is beyond my comprehension for not supporting 4 bytes like utf8 should according to standards. So now i know mysql tables need setting to utf8mb4! So i guess just questions 3 and 4 remain. – Lee Sep 19 '13 at 11:36
  • 1
    Emojis seem to be three-byte sequences: http://www.grumdrig.com/emoji-list/ but I'm not sure whether they are in the Basic Multilingual Plane. The easiest might be to try: Copy & paste one of the characters here - http://en.wikipedia.org/wiki/Emoji try to store them in mySQL and see what happens – Pekka Sep 19 '13 at 11:39
  • 1
    Re 4), you'll have to use a special multibyte library if you're going to do things to the strings that could have a different outcome for single-byte and multi-byte strings - e.g. cutting the string at a specific character location. For just storing the data as-is, you should be fine. – Pekka Sep 19 '13 at 11:40
  • I did read a version of the emoji wiki but more condensed and to the point mentioning that only some emoji's are 3 bytes, and you need 4 bytes to support the full range. – Lee Sep 19 '13 at 11:41
  • Right so any string manipulation code needs to use mbstring... Fantastic, this could end up being a big bill for the client :P – Lee Sep 19 '13 at 11:42
  • Please see http://stackoverflow.com/questions/8709892/mysql-throws-incorrect-string-value-error/8767381#8767381 and http://stackoverflow.com/questions/16858915/migrating-a-php-application-to-handle-utf-8/16862181#16862181 "this could end up being a big bill for the client" Changing the PHP code itself wouldn't be that big a task, unless they are dependent on the behaviour of the Swedish character set e.g. expecting particular characters at certain code points. – Danack Sep 21 '13 at 20:23
  • [Here](https://mathiasbynens.be/notes/mysql-utf8mb4) you can find everything regarding the transition from `utf8` to `utf8mb4` summed up in a single article. – Oliver Maksimovic Nov 24 '14 at 16:50

0 Answers0