0

I am facing some weird collation issue which I really don't understand. I have table containing polish accented letters like "Ł" in two or three columns. Due to occurrence of this letters, utf8_unicode_520_ci collation is most suitable for my purpose. I am facing this issue in my PHP page. Webserver is running Apache 2.4, PHP 7.1.0, MySQL 5.7.

After connecting to database, I am setting charset and collation this way:

$connection->set_charset("utf8");
$connection->query("SET NAMES utf8 COLLATE utf8_unicode_520_ci;");

Afterwards I am double-checking actual charsets and collation using SHOW VARIABLES LIKE '%collat%'; and SHOW VARIABLES LIKE '%char%'; which in formatted form return following information:

collation_connection->utf8_unicode_520_ci
collation_database->utf8_unicode_520_ci
collation_server->binary
character_set_client->utf8
character_set_connection->utf8
character_set_database->utf8
character_set_filesystem->binary
character_set_results->utf8
character_set_server->binary
character_set_system->utf8
character_sets_dir->/usr/share/mysql/charsets/

Until this point everything seems to be corrrect. But i am facing ordering issue when I am running query below:

SELECT birthPlace FROM persons ORDER BY birthPlace DESC LIMIT 5;

which retrun following results:

Łapsze Wyżne
Łapsze Wyżne
Zvolenská Slatina
Župčany
Župčany

Entry "Łapsze Wyżne" should not be listed as last as per utf8_unicode_520_ci collation it should be listed along with entries starting with normal "L" character. So it is evident that it is picking other collation then utf8_unicode_520_ci.I think is utf8_general_ci collation what is used (guessing from results order). Collation of table and column "birthPlace" is utf8_unicode_520_ci. I've expected collation precedence will be default column, table, database respectively.

phpMyAdmin and MySQL Workbench sort entries as expected. But my script is not sorting as expected.

So my question is what collation is really used if it is not explicitly set in query and how to solve it without modifying query if possible.

EDIT: Question itself is not about choosing proper collation as I already know and defined collation which is correct one for my purpose on all places I am able to control. Question is what collation is used if it is not explixitly defined in SELECT query. It is collation of connection, column, table, database, or server? Or it is default collation of used charset (in my case it is utf8_general_ci for uth8 charset? This is not clear to me from documentation. I want to order results by utf8_unicode_520_ci by default without defining it in every query I'll run in future. It is possible?

Rick James
  • 135,179
  • 13
  • 127
  • 222

1 Answers1

0

Since this question seems to be more about Polish collation than about "best for PHP", I reopened it.

utf8_unicode_520_ci treats L=l=Ĺ=ĺ=Ļ=ļ=Ł=ł. Virtually all other utf8 collations treat L=l=Ĺ=ĺ=Ļ=ļ < Ł=ł. In particular, Ł is effectively a letter that comes after L and before M.

I suggest you consider utf8_polish_ci if Polish is the main content of the database. Be aware that grave accents on C,N,O,S,Z may (or may not) be treated the way you want them in this collation.

See this for a rundown of the ordering of letters in the various utf8 collations for MySQL.

The 'blame' for the treatment of Ł goes to Unicode in its version 5.20. Unicode 9.0 also works the same way.

The "best for PHP" (or any other general application) is utf8mb4_unicode_520_ci through MySQL 5.7. In MySQL 8.0, utf8mb4_0900_ai_ci is probably considered 'best'.

(Note: CHARACTER SET utf8 is good enough for most languages. utf8mb4 is a superset, with Emoji and the rest of Chinese added.)

Rick James
  • 135,179
  • 13
  • 127
  • 222