1

In a PHP script I am doing the (hardcoded) MySQL-Request:

SELECT text FROM `mytexts` WHERE text LIKE '%ö%'

which wrongly returns the text "willkommen" (but not the text "willkammen") as well.

I have tried to have everything set correctly for UTF8:

  • The source code file is in "UTF8 without BOM".

  • Database, tables and fields (except the numeric fields) are collated as "utf8_unicode_ci".

  • The connection is done by $server = mysqli_connect($dbhostname,$dbuser,$dbpass,$dbname); followed by mysqli_set_charset($server, "utf8");

  • The data ("willkommen" and "willkammen") has been entered by a hardcoded INSERT

What do I have to do to get my SELECT work correctly?

Wooz
  • 306
  • 1
  • 2
  • 13
  • 1
    This might help you http://stackoverflow.com/questions/279170/utf-8-all-the-way-through/279279#279279 (using utf8mb4 for full UTF8 support) – Eric Martinez Jul 03 '15 at 17:49
  • This might help as well http://stackoverflow.com/a/6226763/3044080 – nomistic Jul 03 '15 at 17:50
  • After reading the two articles above I tried the following: 1.) set database field collation to "utf8mb4_unicode_ci" 2.) changed the code to "mysqli_set_charset($server, "utf8mb4");" 3.) re-inserted the text "willkommen" hardcoded 4.) re- tried 1-3 with additional "mysqli_query($server,"SET NAMES 'utf8mb4'");" Unfortunally this did not help. I will reread the two articles more deeply, but maybe you have a clue which point is the most promising... – Wooz Jul 03 '15 at 18:41
  • As I learned below my understanding of collations was limited. So it seems this wasn't a bug but a feature. Or at least works as designed... – Wooz Jul 04 '15 at 08:55

2 Answers2

2

You probably have to set another collation, one that doesn't treat ö==o but is still case insensitive.

SHOW COLLATION

shows you all available collations.
(_utf8 x'c3b6' is the character ö as unicode hex literal:)

select _utf8 x'c3b6' COLLATE utf8_general_ci LIKE 'o'

-> 1. This collation makes no difference between Umlauten and their corresponding base vowel counterparts.

select _utf8 x'c3b6' COLLATE utf8_bin LIKE 'o'

-> 0. Ok, but this one's not case insensitive.

select _utf8 x'c3b6' COLLATE utf8_german2_ci LIKE 'o'

-> 0. This one is my best guess.

VolkerK
  • 95,432
  • 20
  • 163
  • 226
  • I am really sorry, but I fear I have not yet understood this. Does it mean I have to mention the collation in every request? And does it then work for "ö" only? – Wooz Jul 03 '15 at 18:27
  • But OP does not want it to be equivalent, that's the point. err: I think you wrote '1' for the last use case but it is actually '0' what you meant (because that query returns 0, and that's what OP wants) – Sebas Jul 03 '15 at 18:34
  • @ Sebas doh, you're right. Stupid typo. The query returns 0 - which makes the collation feasible. – VolkerK Jul 03 '15 at 19:47
  • @Wooz , you could ...but you don't have to provide it with every query. see https://dev.mysql.com/doc/refman/5.0/en/charset-table.html – VolkerK Jul 03 '15 at 19:54
  • Mmm, I think I start realizing my big misunderstanding.. so there simply isn't such a thing as a "general" utf8 - and there is no collation to represent at the same time all languages (at least those mainly based on latin letters) ?! – Wooz Jul 03 '15 at 21:00
  • There is _general_ci, but it never considers things like ij or ae or oe or ss; _unicode_ci does take those letter-pairs into account (eg, `ae` = `æ`). It turns out that it you make one language happy, you make another uhappy. Furthermore, some languages have more than one way collate. Example: spanish2_ci has `ch` coming after `cz` but spanish_ci does not. – Rick James Jul 04 '15 at 04:31
1

Here are some of the equavalences for utf8 (or utf8mb4):

utf8 : utf8_croatian_ci            O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö      oe=Œ=œ      oz           Ø=ø
utf8 : utf8_czech_ci               O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö      oe=Œ=œ      oz           Ø=ø
utf8 : utf8_danish_ci              O=o=º=Ò=Ó=Ó=Ô=Õ=ò=ó=ó=ô=õ          oe=Œ=œ      oz
utf8 : utf8_esperanto_ci           O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö      oe=Œ=œ      oz           Ø=ø
utf8 : utf8_estonian_ci            O=o=º=Ò=Ó=Ó=Ô=ò=ó=ó=ô              oe=Œ=œ      oz           Ø=ø
utf8 : utf8_general_ci             O=o=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö        oe          oz
utf8 : utf8_general_mysql500_ci    O=o=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö        oe          oz
utf8 : utf8_german2_ci             O=o=º=Ò=Ó=Ó=Ô=Õ=ò=ó=ó=ô=õ          oe=Ö=ö=Œ=œ  oz           Ø=ø
utf8 : utf8_hungarian_ci           O=o=º=Ò=Ó=Ó=Ô=Õ=ò=ó=ó=ô=õ          oe=Œ=œ      oz  Ö=ö      Ø=ø
utf8 : utf8_icelandic_ci           O=o=º=Ò=Ô=Õ=ò=ô=õ                  oe=Œ=œ      oz  Ó=Ó=ó=ó
utf8 : utf8_latvian_ci             O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö      oe=Œ=œ      oz           Ø=ø
utf8 : utf8_lithuanian_ci          O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö      oe=Œ=œ      oz           Ø=ø
utf8 : utf8_persian_ci             O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö      oe=Œ=œ      oz           Ø=ø
utf8 : utf8_polish_ci              O=o=º=Ò=Ô=Õ=Ö=ò=ô=õ=ö              oe=Œ=œ      oz  Ó=Ó=ó=ó  Ø=ø
utf8 : utf8_roman_ci               O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö      oe=Œ=œ      oz           Ø=ø
utf8 : utf8_romanian_ci            O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö      oe=Œ=œ      oz           Ø=ø
utf8 : utf8_sinhala_ci             O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö      oe=Œ=œ      oz           Ø=ø
utf8 : utf8_slovak_ci              O=o=º=Ò=Ó=Ó=Õ=Ö=ò=ó=ó=õ=ö          oe=Œ=œ      oz  Ô=ô      Ø=ø
utf8 : utf8_slovenian_ci           O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö      oe=Œ=œ      oz           Ø=ø
utf8 : utf8_spanish2_ci            O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö      oe=Œ=œ      oz           Ø=ø
utf8 : utf8_spanish_ci             O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö      oe=Œ=œ      oz           Ø=ø
utf8 : utf8_swedish_ci             O=o=º=Ò=Ó=Ó=Ô=Õ=ò=ó=ó=ô=õ          oe=Œ=œ      oz
utf8 : utf8_turkish_ci             O=o=º=Ò=Ó=Ó=Ô=Õ=ò=ó=ó=ô=õ          oe=Œ=œ      oz  Ö=ö      Ø=ø
utf8 : utf8_unicode_520_ci         O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=Ø=ò=ó=ó=ô=õ=ö=ø  oe=Œ=œ      oz
utf8 : utf8_unicode_ci             O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=ò=ó=ó=ô=õ=ö      oe=Œ=œ      oz           Ø=ø
utf8 : utf8_vietnamese_ci          O=o=º=Ò=Ó=Ó=Õ=Ö=ò=ó=ó=õ=ö          oe=Œ=œ      oz  Ô=ô      Ø=ø

It says that, for most collations, ö is treated identical to o. However, hungarian_ci and turkish_ci sort ö as if it were a letter between o and p. And german2_ci treats it equal to oe and œ. For danish_ci, icelandic_ci and swedish_ci, it comes after z.

More collations details

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • This visualizes it greatly. Seems I was quite naive and it as you said above "if you make one language happy, you make another uhappy...". At least this table allows picking the least worst collation - even though I am still puzzled by the fact letters are treated identically to others.. – Wooz Jul 04 '15 at 08:51
  • Many people have grumbled that there is no utf8 collation that separates case-sensitivity from accent-sensitivity. – Rick James Jul 04 '15 at 15:15