8

All the MySQL tables in my PHP web application are MyISAM with utf8 encoding. Since records can be generated from a companion app while it's offline, my table keys are randomly generated, alphanumeric VARCHARs; these fields are set to binary with utf8_bin encoding so they can be case-sensitive.

I recently decided to change the encoding of all my text fields, to support emojis that some users like to enter. I went ahead and changed all utf8 fields to utf8mb4, including the keys. I immediately started seeing performance issues, where complex SELECT queries on one of the larger tables took more than a minute, and then other queries queued up waiting for table locks. I changed the encoding of the primary key field on that table back to utf8, and performance returned to normal. A couple days later, I changed that one field to utf8mb4 again, the queries started queueing up again, and I changed it back to restore the normal performance.

So this runs smoothly:

`ID` varchar(8) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT ''

But this causes problems:

`ID` varchar(8) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL DEFAULT ''

Everything I've read says utf8 and utf8mb4 should have equivalent performance, but I'm seeing a distinct difference in my case. Does this make sense?

It's not really a problem to keep the key fields at utf8, since I don't foresee ever using more than simple alphanumeric characters there. But I would have liked to have all the fields set to the same encoding just for consistency and simplicity of maintenance (don't have to remember to set user-populated fields to one encoding and key fields to another encoding).

Regarding the comment from @MandyShaw

When I work with the database with the Sequel Pro Mac app, the console constantly shows pairs of SET NAMES 'utf8' and SET NAMES 'utf8mb4' entries, so that does suggest not everything is set correctly. However, here's what I have currently:

MySQL [(none)]> SHOW GLOBAL VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name            | Value              |
+--------------------------+--------------------+
| character_set_client     | utf8mb4            |
| character_set_connection | utf8mb4            |
| character_set_database   | utf8mb4            |
| character_set_filesystem | binary             |
| character_set_results    | utf8mb4            |
| character_set_server     | utf8mb4            |
| character_set_system     | utf8               |
| collation_connection     | utf8mb4_unicode_ci |
| collation_database       | utf8mb4_unicode_ci |
| collation_server         | utf8mb4_unicode_ci |
+--------------------------+--------------------+

I read that character_set_system can't be changed from utf8 and character_set_filesystem should be binary.

Sequel Pro's connection encoding was set to Autodetect, but when I change it explicitly to utf8mb4, then open a new connection, I still see all those encoding changes in the console.

Is there something else I need to change to use this encoding consistently?

arlomedia
  • 8,534
  • 5
  • 60
  • 108
  • could you provide some benchmarks? which queries exactly are you executing that give you problems? and what encoding do you use in your _connection_? – Franz Gleichmann Jun 14 '18 at 05:51
  • 1
    It sounds to me as if there is conversion going on somewhere (a specific character set performing the same as another /when all processing uses that character set/ is one thing, but processing that involves two character sets is quite another). I wonder if you have something in the database that is hardcoded to utf8 but that you didn't spot in your trawl through. – MandyShaw Jun 14 '18 at 06:01
  • @arlomedia Did you ever find a solution? I'm seeing the same slow performance after converting... – rubberchicken Jul 18 '18 at 15:59
  • @rubberchicken I just changed the key fields back to utf8 until I have time to troubleshoot more. – arlomedia Jul 20 '18 at 17:43
  • Same here.. let me know if you figure it out and ill do same if i do – rubberchicken Jul 21 '18 at 00:50

1 Answers1

9

utf is really utfmb3 and may use max 3 bytes per character while utfmb4 may use 4 bytes per character. For VARCHAR columns this does not normally much difference since MySQL will store only as many bytes as needed (unless you have created your MyISAM tables with ROW_FORMAT=FIXED).

However, during query execution, MySQL may create temporary tables in the MEMORY storage engine which does not support variable-length rows. These temporary tables have a maximum size, and if that size is exceeded, the temporary tables will be converted to tables in MyISAM/InnoDB (depending on your version of MySQL). The status variable Created_tmp_disk_tables will be incremented each time this happens. If so, try to see if it helps to increase the value of max_heap_table_size and tmp_table_size.

Alternatively, upgrade to MySQL 8.0 where a new storage engine that supports variable-length rows is used for internal temporary tables.

Øystein Grøvlen
  • 1,266
  • 6
  • 8
  • An aside: 8.0 has much faster collation code, though it may not be any faster for the trivial `_bin` collations. – Rick James Jun 16 '18 at 19:49
  • If you don't need more than ascii's "alpha" characters, then use ascii_bin. This will shrink, rather than increase, the tmp table space. – Rick James Jun 16 '18 at 19:51
  • I don't see a way to check created_tmp_disk_tables through my cloud database provider, but I'm pretty sure the temp tables in memory are not being created on disk, because I'm not seeing my disk usage changing. But would the size of the index still affect performance even when working in memory? If so, changing the key fields to ascii would be a good idea. – arlomedia Jul 20 '18 at 18:01
  • If performance schema is available, you may also see this information in the events_statements_history table. (There is a column for created_tmp_disk_tables.) Pages of InnoDB temporary tables will be cached in the InnoDB buffer pool if there is space. Hence, you will not necessarily see any disk activity. Still, there is more overhead to such tables than to pure memory tables. – Øystein Grøvlen Jul 24 '18 at 07:06