2

i am having an issue with german umlaute ä ö ü ... when using match against in my sql query

when i search for "äpfel" i get also "apfel" when using utf8_general_ci but what i want is to just get "äpfel", now when i change to utf8_bin i get exactly that BUT its case sensitive. so Äpfel or Apfel will not find äpfel or apfel

i would need utf8_bin_ci which does actually not exist i assume?

using utf8_bin with match against and lower(field) or lcase(field) turns out with an error so....what can i do to solve this?

query is like;

select * FROM table where MATCH(field) AGAINST ('äpfel')

and what i need but does not work is

select * FROM table where MATCH(lcase(field)) AGAINST ('äpfel')

in utf8_bin

Janus
  • 41
  • 3

2 Answers2

1

Alas, you get a choice between comparing the bits (no case or accent folding) with ..._bin or both case-folding and accent stripping with all the ..._ci Collations.

This provides info what characters are equivalent or not. For example, in utf8_german2_ci, these are treated as equal:

    A=a=ª=À=Á=Â=Ã=Å=à=á=â=ã=å=Ā=ā=Ą=ą

But these come later when sorting:

    ae=Ä=Æ=ä=æ

Of note: _general_ci never treats a 2-letter pair as equal to a single character, hence ae and æ are necessarily unequal in utf8_general_ci.

Rick James
  • 135,179
  • 13
  • 127
  • 222
0

It does sound like you need utf8_bin_ci but it does not exists, only utf8_bin does.

If you want to add case insensitivity, you need to add LOWER() to both elements like:

mysql> SELECT LOWER('Äpfel') = LOWER('äpfel') COLLATE utf8_bin;
+----------------------------------------------------+
| LOWER('Äpfel') = LOWER('äpfel') COLLATE utf8_bin   |
+----------------------------------------------------+
|                                                  1 |
+----------------------------------------------------+
1 row in set (0.00 sec)

-> expected result


mysql> SELECT LOWER('Äpfel') = LOWER('apfel') COLLATE utf8_bin;
+---------------------------------------------------+
| LOWER('Äpfel') = LOWER('apfel') COLLATE utf8_bin  |
+---------------------------------------------------+
|                                                 0 |
+---------------------------------------------------+
1 row in set (0.00 sec)

-> expected result as well

Update: Also please have a look at this interesting similar answer: utf8_bin vs. utf_unicode_ci

Gab
  • 3,404
  • 1
  • 11
  • 22
  • 2
    `*_bin` means binary -- exact equivalence. `*_bin_ci` would be "exact equivalence, case-insensitive" which is a logical contradiction. Characters in binary collations are neither upper or lower case, since lettercase is a form of deriving equivalence, and no character is equivalent to any other character in a binary collation. – Michael - sqlbot Mar 13 '17 at 04:02