6

I'm running into an issue where I'm getting differently ordered results when querying with PHP Versus the command line. From my research, it appears that in some cases that bad encoding can cause problems with the order of the results.

That said, all my DB tables are encoded as utf8mb4, with the collation utf8mb4_general_ci. However, it doesnt seem that the mysql variables are set correctly.

I'm on Mysql 5.5.5-10.1.26-MariaDb.

Here are my CNF settings, but to be honest I don't know what I'm doing here:

[client]
default-character-set=utf8mb4

[mysql]
default-character-set=utf8mb4

[mariadb]


[mysqld]

character-set-server=utf8mb4
character_set_client=utf8mb4
collation-server=utf8mb4_general_ci

The variables output from mysql:

character_set_client        utf8
character_set_connection    utf8
character_set_database      utf8mb4
character_set_filesystem    binary
character_set_results       utf8
character_set_server        utf8mb4
character_set_system        utf8
collation_connection        utf8_general_ci
collation_database          utf8mb4_unicode_ci
collation_server            utf8mb4_general_ci

Update: A person has asked for how I'm connecting to the database:

$this->connection = new PDO('mysql:host='.DB_SERVER.';dbname='.DB_NAME.';port='.DB_PORT, DB_USER, DB_PASS, $options);

Update: I've switched to utf8mb4_unicode_ci (as per suggestions in answers below).

stwhite
  • 3,156
  • 4
  • 37
  • 70
  • Pls show us how you connect to mariadb from pup and which API you use. – Shadow Nov 30 '17 at 06:18
  • I'm not just using PHP, I also use other languages. So I'd rather not set this in PHP itself. None of which should have control over changing this... that said I've updated the question with that info. – stwhite Nov 30 '17 at 06:23
  • They do have a control over setting the character sets because different clients may use different character sets and collation. – Shadow Nov 30 '17 at 06:41
  • If I need to set it for each that's not a big problem. I just want to make sure I'm on the right track considering a mess up in encoding causes problems all the way down the line. – stwhite Nov 30 '17 at 06:43
  • 1
    See https://stackoverflow.com/questions/4361459/php-pdo-charset-set-names – Shadow Nov 30 '17 at 06:44

4 Answers4

8

You want to have character-set-client-handshake = FALSE as well.

With /etc/my.cnf.d/character-set.cnf

# https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/
# https://mariadb.com/kb/en/library/setting-character-sets-and-collations/
# https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-utf8mb4-11761243e434
# https://stackoverflow.com/questions/47566730/force-mariadb-clients-to-use-utf8mb4

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
collation-server = utf8mb4_unicode_ci
init-connect = 'SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci'
character-set-server = utf8mb4

I get everything to be utf8mb41

MariaDB [(none)]> show variables like 'char%'; show variables like 'collation%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | utf8mb4                    |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | utf8mb4_unicode_ci |
| collation_database   | utf8mb4_unicode_ci |
| collation_server     | utf8mb4_unicode_ci |
+----------------------+--------------------+
3 rows in set (0.00 sec)

MariaDB [(none)]>

however without the character-set-client-handshake line some are still utf8

MariaDB [(none)]> show variables like 'char%'; show variables like 'collation%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8mb4                    |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | utf8_general_ci    |
| collation_database   | utf8mb4_unicode_ci |
| collation_server     | utf8mb4_unicode_ci |
+----------------------+--------------------+
3 rows in set (0.01 sec)

MariaDB [(none)]>

1 character_set_system is always utf8.

hlovdal
  • 26,565
  • 10
  • 94
  • 165
2

You should probably use utf8mb4_unicode_ci instead of utf8mb4_general_ci as it's more accurate. Unless you're running MariaDB on a system with an old/limited CPU and performance is a huge concern.

That being said, the solution is to set init_connect in your MariaDB configuration (or --init-connect on the command line):

init_connect = "SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci"

Either way is fine. I am not recommending one way over the other. Both are equally valid approaches.

Your MariaDB configuration may be in my.cnf or a file included by my.cnf, typically found under /etc/mysql. Check your system documentation for details. Because you are configuring a server variable, as indicated by the MariaDB documentation linked to above, you should set the variable in the server part of the configuration file. The server part of the configuration files is indicated by the INI section names ending in "d". An INI section is denoted by a keyword surrounded by square brackets, e.g. "[section]". The "d" stands for "daemon", which is standard UNIX nomenclature for a server process. You can set the variable in either the [mysqld] section or the [mariadb] section. Because the init_connect server variable is common to both MySQL and MariaDB, I would recommend you put it under [mysqld].

I see that you are setting character_set_client=utf8mb4 in your pasted configuration. You don't need to do this. You can delete or comment out the line. Comments are lines starting with pound symbol (#), also known as a hash mark, octothorp, or number sign.

Any and all clients that connect to the server will execute these command(s) before any other commands are processed.

mwp
  • 8,217
  • 20
  • 26
  • What's the reason for using utf8mb4_unicode_ci instead of utf8mb4_general_ci? It would help to put that in your answer. Also, are you suggesting to not using the my.cnf file and why? – stwhite Nov 30 '17 at 06:10
  • It's more accurate. I'll amend my answer to say that. And no... I'm not sure why you think I'm suggesting that. – mwp Nov 30 '17 at 06:11
  • I see now that you're saying place that line in my.cnf but does that line go with other lines? Do I need the other lines? – stwhite Nov 30 '17 at 06:13
  • Yes, under `[mysqld]` (or `[mariadb]`), and yes, you need them, but you don't need to set `character_set_client`. – mwp Nov 30 '17 at 06:15
  • This will not affect the connections made from php, they will override this setting. – Shadow Nov 30 '17 at 06:17
  • @mwp could you expand upon your answer with all the necessary parts to your answer? – stwhite Nov 30 '17 at 06:28
  • @stwhite As you wish. – mwp Nov 30 '17 at 06:44
  • Made those changes and `collation_connection` is still set to `utf8mb4_general_ci` after restarting mysql. – stwhite Nov 30 '17 at 07:03
  • @stwhite Wait, isn't that good? Or did you mean still set to `utf8_general_ci`? – mwp Nov 30 '17 at 07:06
  • I switched from `utf8_general_ci ` to `utf8mb4_unicode_ci` as per your suggestion, so `collation_connection` should be `utf8mb4_unicode_ci`. – stwhite Nov 30 '17 at 07:07
2

init_connect is not performed by anyone connecting as root, so it is not as universal as you would like.

SET NAMES utf8mb4 sets 3 things; experiment to see that. You need all 3.

If you weren't as far back as 5.5, I would recommend utf8mb4_unicode_520_ci as being a better collation: "Unicode collation names now may include a version number to indicate the Unicode Collation Algorithm (UCA) version on which the collation is based. Initial collations thus created use version UCA 5.2.0. For example, utf8_unicode_520_ci is based on UCA 5.2.0. UCA-based Unicode collation names that do not include a version number are based on version 4.0.0."

Version 8.0 has Unicode 9.0 standard.

Back to the question: There is no perfect solution; the user can override whatever you do -- either through ignorance or through malice.

You could police the tables created, but that won't keep them from connecting incorrectly. Or correctly, but with a different charset. It is valid to do SET NAMES latin1, then provide latin1-encode bytes. MySQL will convert as it stores/fetches.

But if they have utf8-encoded bytes, but say SET NAMES latin1, you get "double encoding". This "bug" destroys any chance of collating correctly, but is otherwise (usually) transparent. That is, stuff is messed up as it is stored, then un-messed up as it is fetched.

Rick James
  • 135,179
  • 13
  • 127
  • 222
1

To fix this warning you should edit

/etc/my.cnf (my.ini on Windows)

Simply add/set in the file

[client]
default-character-set=utf8mb4

[mysql]
default-character-set=utf8mb4

[mysqld]
collation-server=utf8mb4_unicode_ci
init-connect='SET NAMES utf8mb4'
character-set-server=utf8mb4
Community
  • 1
  • 1
Ryabchenko Alexander
  • 10,057
  • 7
  • 56
  • 88