It is best to use character set utf8mb4
with the collation utf8mb4_unicode_ci
.
The character set, utf8
, only supports a small amount of UTF-8 code points, about 6% of possible characters. utf8
only supports the Basic Multilingual Plane (BMP). There 16 other planes. Each plane contains 65,536 characters. utf8mb4
supports all 17 planes.
MySQL will truncate 4 byte UTF-8 characters resulting in corrupted data.
The utf8mb4
character set was introduced in MySQL 5.5.3 on 2010-03-24.
Some of the required changes to use the new character set are not trivial:
- Changes may need to be made in your application database adapter.
- Changes will need to be made to my.cnf, including setting the character set, the collation and switching innodb_file_format to Barracuda
- SQL CREATE statements may need to include:
ROW_FORMAT=DYNAMIC
- DYNAMIC is required for indexes on VARCHAR(192) and larger.
NOTE: Switching to Barracuda
from Antelope
, may require restarting the MySQL service more than once. innodb_file_format_max
does not change until after the MySQL service has been restarted to: innodb_file_format = barracuda
.
MySQL uses the old Antelope
InnoDB file format. Barracuda
supports dynamic row formats, which you will need if you do not want to hit the SQL errors for creating indexes and keys after you switch to the charset: utf8mb4
- #1709 - Index column size too large. The maximum column size is 767 bytes.
- #1071 - Specified key was too long; max key length is 767 bytes
The following scenario has been tested on MySQL 5.6.17:
By default, MySQL is configured like this:
SHOW VARIABLES;
innodb_large_prefix = OFF
innodb_file_format = Antelope
Stop your MySQL service and add the options to your existing my.cnf:
[client]
default-character-set= utf8mb4
[mysqld]
explicit_defaults_for_timestamp = true
innodb_large_prefix = true
innodb_file_format = barracuda
innodb_file_format_max = barracuda
innodb_file_per_table = true
# Character collation
character_set_server=utf8mb4
collation_server=utf8mb4_unicode_ci
Example SQL CREATE statement:
CREATE TABLE Contacts (
id INT AUTO_INCREMENT NOT NULL,
ownerId INT DEFAULT NULL,
created timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
modified timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
contact VARCHAR(640) NOT NULL,
prefix VARCHAR(128) NOT NULL,
first VARCHAR(128) NOT NULL,
middle VARCHAR(128) NOT NULL,
last VARCHAR(128) NOT NULL,
suffix VARCHAR(128) NOT NULL,
notes MEDIUMTEXT NOT NULL,
INDEX IDX_CA367725E05EFD25 (ownerId),
INDEX created (created),
INDEX modified_idx (modified),
INDEX contact_idx (contact),
PRIMARY KEY(id)
) DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ENGINE = InnoDB ROW_FORMAT=DYNAMIC;
- You can see error #1709 generated for
INDEX contact_idx (contact)
if ROW_FORMAT=DYNAMIC
is removed from the CREATE statement.
NOTE: Changing the index to limit to the first 128 characters on contact
eliminates the requirement for using Barracuda with ROW_FORMAT=DYNAMIC
INDEX contact_idx (contact(128)),
Also note: when it says the size of the field is VARCHAR(128)
, that is not 128 bytes. You can use have 128, 4 byte characters or 128, 1 byte characters.
This INSERT
statement should contain the 4 byte 'poo' character in the 2 row:
INSERT INTO `Contacts` (`id`, `ownerId`, `created`, `modified`, `contact`, `prefix`, `first`, `middle`, `last`, `suffix`, `notes`) VALUES
(1, NULL, '0000-00-00 00:00:00', '2014-08-25 03:00:36', '1234567890', '12345678901234567890', '1234567890123456789012345678901234567890', '1234567890123456789012345678901234567890', '12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678', '', ''),
(2, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '', '', '', '', ''),
(3, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '', '', '123', '', '');
You can see the amount of space used by the last
column:
mysql> SELECT BIT_LENGTH(`last`), CHAR_LENGTH(`last`) FROM `Contacts`;
+--------------------+---------------------+
| BIT_LENGTH(`last`) | CHAR_LENGTH(`last`) |
+--------------------+---------------------+
| 1024 | 128 | -- All characters are ASCII
| 4096 | 128 | -- All characters are 4 bytes
| 4024 | 128 | -- 3 characters are ASCII, 125 are 4 bytes
+--------------------+---------------------+
In your database adapter, you may want to set the charset and collation for your connection:
SET NAMES 'utf8mb4' COLLATE 'utf8mb4_unicode_ci'
In PHP, this would be set for: \PDO::MYSQL_ATTR_INIT_COMMAND
References: