13

I have some MySQL tables with utf8mb4 fields, and others with utf8.

It is safe to use utf8mb4 in the PDO connection string for all tables? Or do I have to convert everything to utf8mb4, or start two different PDO connections?


EDIT: The question is not "can I store 4-byte characters into utf8 columns?" We already know we can't, that doesn't depend on the connection, so if a column is utf8 it means it will not receive 4 bytes characters, for example country or currency codes, email addresses, usernames... where the input is validated by the application.

the_nuts
  • 5,634
  • 1
  • 36
  • 68

3 Answers3

16

This can be tested quite easily with the following script:

<?php

$pdo = new PDO('mysql:host=localhost;dbname=test', 'test', '');

$pdo->exec("
    drop table if exists utf8_test;
    create table utf8_test(
        conn varchar(50) collate ascii_bin,
        column_latin1  varchar(50) collate latin1_general_ci,
        column_utf8    varchar(50) collate utf8_unicode_ci,
        column_utf8mb4 varchar(50) collate utf8mb4_unicode_ci
    );
");

$latin = 'abc äŒé';
$utf8  = '♔♕';
$mb4   = ' ';

$pdo->exec("set names utf8");

$pdo->exec("
    insert into utf8_test(conn, column_latin1, column_utf8, column_utf8mb4)
     values ('utf8', '$latin', '$latin $utf8', '$latin $utf8 $mb4')
");

$pdo->exec("set names utf8mb4");

$pdo->exec("
    insert into utf8_test(conn, column_latin1, column_utf8, column_utf8mb4)
     values ('utf8mb4', '$latin', '$latin $utf8', '$latin $utf8 $mb4')
");

$result = $pdo->query('select * from utf8_test')->fetchAll(PDO::FETCH_ASSOC);

var_export($result);

And this is the result:

array (
  0 => 
  array (
    'conn' => 'utf8',
    'column_latin1' => 'abc äŒé',
    'column_utf8' => 'abc äŒé ♔♕',
    'column_utf8mb4' => 'abc äŒé ♔♕ ???? ????',
  ),
  1 => 
  array (
    'conn' => 'utf8mb4',
    'column_latin1' => 'abc äŒé',
    'column_utf8' => 'abc äŒé ♔♕',
    'column_utf8mb4' => 'abc äŒé ♔♕  ',
  ),
)

As you can see, we can not use utf8 as connection charset, when we work with utf8mb4 columns (see ????). But we can use utf8mb4 for connection when working with utf8 columns. Also neither has problems writing to and reading from latin or ascii columns.

The reason is that you can encode any utf8, latin or ascii character in utf8mb4 but not the other way around. So using utf8mb4 as character set for connection is safe in this case.

Paul Spiegel
  • 30,925
  • 5
  • 44
  • 53
  • 1
    Thanks Paul, I found out the same by doing it. But since I must evaluate this for an enterprise project, I cannot rely on tests: as I stated in the bounty, I'm "Looking for an answer drawing from credible and/or official sources.", where "official" is the keyword. – Dr. Gianluigi Zane Zanettini Mar 28 '19 at 13:40
  • @Dr.GianluigiZaneZanettini - What is more credible than a reproducible test? You can even run a test with all `utf8` (3-byte) characters. – Paul Spiegel Mar 28 '19 at 13:49
  • What is more credible is official documentation stating this. The problem with "let's try and see what happen" is that you could always miss some test-case. – Dr. Gianluigi Zane Zanettini Apr 01 '19 at 08:11
  • 2
    @Dr.GianluigiZaneZanettini - I doubt you will ever find a statement like "*It is safe to use character set X for connection and character set Y for stored data, if you only use characters from Y*". They would need to write it for every pair of character sets. You will not even find a statement, which tells you that you can safely use utf8mb4 connection for utf8mb4 data. (No - it's not a typo - I mean the same charset) You need to interpret the docs. My interpretation is the last sentence of my answer. Read [Chapter 10](https://dev.mysql.com/doc/refman/8.0/en/charset.html) - entirely. – Paul Spiegel Apr 01 '19 at 12:05
3

Short Answer: Yes, if you are only using 3-byte (or shorter) UTF-8 characters.

Or... No if you intend to work with 4-byte UTF-8 characters such as .

Long Answer:

(And I will address why "no" could be the right answer.)

The connection establishes what encoding the client is using.

The CHARACTER SET on a column (or, by default, from the table) establishes what encoding can be put into the column.

CHARACTER SET utf8 is a subset of utf8mb4. That is, all characters acceptable to utf8 (via connection or column) are acceptable to utf8mb4. Phrased another way, MySQL's utf8mb4 (same as the outside world's UTF-8) have the full 4-byte utf-8 encoding that includes more Emoji, more Chinese, etc, than MySQL up-to-3-byte utf8 (aka "BMP")

(Technically, utf8mb4 only handles up to 4 bytes, but UTF-8 handles longer characters. However, I doubt if 5-byte characters will happen in my lifetime.)

So, here's what happens with any 3-byte (or shorter) UTF-8 character in the client, given that the Connection is utf8mb4 and the columns in the tables are only utf8: Every character goes into and comes out of the server without transformation and without errors. Note: The problem occurs on INSERT, not on SELECT; however you may not notice the problem until you do a SELECT.

But, what if you have an Emoji in the client? Now you will get an error. (Or a truncated string) (Or question mark(s)) This is because the 4-byte Emoji (eg, ) cannot be squeezed into the 3-byte "utf8" (or "1-byte latin1" or ...).

If you are running 5.5 or 5.6, you may run into the 767 (or 191) problem. I provide several workarounds in here. None is perfect.

As for inverting (utf8 connection but utf8mb4 columns): The SELECT can have trouble if you manage to get some 4-byte characters into the table.

"Official sources" -- Good luck. I have spent a decade trying to tease out the ins and outs of character handling, and then simplify it into actionable sentences. Much of that time was thinking I had all the answers, only to encounter yet another failing test case. The common cases are listed in Trouble with UTF-8 characters; what I see is not what I stored . However, that does not directly address your question!

From Comment

mysql> SHOW CREATE TABLE emoji\G
*************************** 1. row ***************************
       Table: emoji
Create Table: CREATE TABLE `emoji` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `text` varchar(255) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8mb4
1 row in set (0.00 sec)

mysql> insert into emoji (text) values ("abc");
Query OK, 1 row affected (0.01 sec)

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8mb4                    |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

The above says that the "connection" (think "client") is using utf8, not utf8mb4.

mysql> insert into emoji (text) values ("");  -- 4-byte Emoji
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+----------------------------------------------------------------------------------+
| Level   | Code | Message                                                                          |
+---------+------+----------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xF0\x9F\x98\x85\xF0\x9F...' for column 'text' at row 1 |
+---------+------+----------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Now, change the 'connection' to utf8mb4:

mysql> SET NAMES utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into emoji (text) values ("");
Query OK, 1 row affected (0.01 sec)

mysql> SELECT * FROM emoji;
+----+--------------+
| id | text         |
+----+--------------+
|  1 | ? ? ? ?      |
|  2 | abc          |
|  3 | ???????????? |   -- from when "utf8" was in use
|  4 |              |  -- Success with utf8mb4 in use
+----+--------------+
4 rows in set (0.01 sec)
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • 1
    Thanks for this! Just a clarification: "given that the Connection is utf8mb4 and the columns in the tables are only utf8" <= I fully agree that with this configration you still cannot store 4-byte chars in MySQL (you would need utf8mb4 on the table too). But this would happend with utf8 (non-mb4) connection too, right? So nothing changes here... – Dr. Gianluigi Zane Zanettini Apr 01 '19 at 08:09
  • @Dr.GianluigiZaneZanettini - The connection parameter says what the encoding of the _client_ is. If it is wrong, various errors can occur (see that other bug I linked to). Having a 4-byte Emoji in the client is one form of "wrong" if the connection specifies that you have only 3-byte "utf8" in the client. – Rick James Apr 01 '19 at 16:57
  • @Dr.GianluigiZaneZanettini - Perhaps what I just added to my Answer directly addresses your comment. – Rick James Apr 01 '19 at 17:04
0

Short answer: NO, it is not safe.

If your data has utf8mb4 characters and you are using a MySQL utf8 charset connection, you will run into problems since MySQL utf8 charset supports only BMP characters (up to 3 bytes characters).

My recommendation is to convert all tables to utf8mb4 for full UTF-8 support. Also, utf8mb4 is backwards compatible with utf8.

randrade86
  • 346
  • 1
  • 10
  • 1
    Backwards compatible but with some issues: the maximum indexable length with InnoDB falls from 255 to 191 characters, and fixed-lenght fields (ex. CHAR) use one more byte per character – the_nuts Jan 04 '16 at 22:44
  • @the_nuts Well pointed. There is more info [here](https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-upgrading.html) – randrade86 Jan 04 '16 at 23:10
  • 7
    You answered the inverted question. No, it is not safe to use utf8 connection with utf8mb4 columns, but shouldn't it work fine the other way around, which was what OP asked? – nickdnk Jul 11 '18 at 15:07
  • @nickdnk the funny thing is that the OP has accepted the answer – Accountant م Mar 29 '19 at 02:25
  • 1
    @Accountantم you're right, I accepted it because it was the only one, but it doesn't answer the question – the_nuts Mar 29 '19 at 13:01
  • 1
    actually I accepted it but at the end I'm using a utf8mb4 connection that handles some utf8 and latin columns, without any issue – the_nuts Mar 29 '19 at 13:13
  • It is possible to move the "Accepted" flag to another answer. If there is a 'better' Answer, that would help other readers. – Rick James Mar 29 '19 at 22:19