1

First, I would like to assure you that I have done my "homework" and have read this, this, this and this. Also, one of my former questions is closely related to this one, but in that question I am dealing about flourishlib's compatibility issues with utf8mb4. This question deals with a deeper level. Let's suppose that I have several tables and I want to modify just a few columns to have utf8mb4 encoding, to preserve some storage space and performance after the change. If I changed the whole database to have an encoding of utf8mb4, then its size would increase with 33%, which will affect its performance badly as well. So, we have selected four columns from three different tables to support emojis. These are:

  • users.bio (tinytext, utf8_general_ci)
  • questions.question (longtext, utf8_general_ci)
  • questions.answer (longtext, ut8_general_ci)
  • comments.comment (tinytext, utf8_general_ci)

As a result, my action plan is as follows:

  1. Create a backup of the database

  2. Run these commands:

alter table comments change comment comment tinytext character set utf8mb4 collate utf8mb4_unicode_ci;
alter table users change bio bio tinytext character set utf8mb4 collate utf8mb4_unicode_ci;
alter table questions change question question longtext character set utf8mb4 collate utf8mb4_unicode_ci;
alter table questions change answer answer longtext character set utf8mb4 collate utf8mb4_unicode_ci;

Expectations:

  • this should make the specified columns use utf8mb4 instead of utf8
  • existent data will be correctly converted to utf8mb4, that is, previous texts will be preserved and the users will be able to correctly read their content
  • other columns will not be changed
  • queries involving the affected tables will be slower

Are my expectations accurate? Do I need to change the connection? Thanks

Community
  • 1
  • 1
Lajos Arpad
  • 64,414
  • 37
  • 100
  • 175
  • 1
    Weird... apparently SO's Markdown doesn't support having a code block immediately following a list. – Mike Sep 07 '15 at 23:00

1 Answers1

1

You need utf8mb4 in any columns that are storing Chinese.

In VARCHAR(...) utf8mb4, each "character" takes 1-4 bytes. No 33% increase. On the other hand, CHAR(10) utf8mb4 is always allocated 40 bytes.

You do need to establish that your client is talking utf8mb4, not just utf8. That comes in some parameter in the connection or with SET NAMES utf8mb4.

If you need to automate the ALTERs, it is pretty easy to generate them via a SELECT into information_schema.

Addenda

Expectations 1-3: Yes.

Expectation 4 (queries involving the affected tables will be slower) -- processing will be essentially the same speed.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Rick, I have no business with Chinese characters. As I have described in my question we are talking about emojis. Also, your remark about varchar's size not increasing is invalid. Imagine a 10-character long varchar stored in utf8. That would take 30 bytes. Now, imagine a 10-character long varchar stored in utf8mb4. That would take 40 bytes.33% increase. Also, if you read the question again, you will see that types involved are tinytext and longtext. I need 4 alters, no need to automate them. – Lajos Arpad Sep 08 '15 at 07:14
  • Your answer does not really address the question. I would really like to accept it, but you need to make edits to make sure it addresses the question. In the meantime I have resolved the issue, so I will either accept and answer or write my own. – Lajos Arpad Sep 08 '15 at 07:16
  • It seems that your argument about varchars is valid, but it does not affect this question where there is no varchar involved. – Lajos Arpad Sep 08 '15 at 07:29
  • Text _without_ emojis will take exactly the same space in `VARCHAR` or `TEXT` for `utf8` as for `ut8mb4`. (And I answered the 'expectations') – Rick James Sep 08 '15 at 17:59
  • OK, the Adenda made your answer THE answer, thanks. I will not write a separate answer here. – Lajos Arpad Sep 08 '15 at 19:46