6

I'm trying to build a MySQL search query. LIKE would be perfectly fine, but the client wants inputs with "oe" find "ö", "ae" find "ä" and "ue" find "ü" as it's fairly common in Germany.
I tried using REGEXP after replacing every occurence of "oe" to (oe|ö) but REGEXP is strict and doesn't match (for instance) "é" to "e".
Is there a way to make LIKE match "oe|ue|ae" or maybe some other way I haven't though of?
Thanks,
thomas

thomas
  • 2,297
  • 2
  • 22
  • 23
  • 2
    Wow that was quick. Downvoted before I was finished ... :( – thomas Jul 28 '16 at 12:22
  • Probably consider investing in a search engine like Solr. – apokryfos Jul 28 '16 at 12:27
  • Tip: set utf8_general in mysql then `e` will find `è` http://stackoverflow.com/questions/2344118/utf-8-general-bin-unicode – JustOnUnderMillions Jul 28 '16 at 12:28
  • Not with Regexp @JustOnUnderMillions – thomas Jul 28 '16 at 12:32
  • I wouldn't use regexp. Mysql's regexp doesn't work well with multibyte characters. `The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multibyte safe and may produce unexpected results with multibyte character sets.` http://dev.mysql.com/doc/refman/5.7/en/regexp.html – chris85 Jul 28 '16 at 12:35

2 Answers2

9

At Character Sets and Collations Supported by MySQL I can only spot two German collations:

  • latin1_german1_ci
  • latin1_german2_ci

It seems that latin1_german2_ci is the one you want, however it expects Latin1:

latin1_german2_ci (phone-book) rules:

  • Ä = AE
  • Ö = OE
  • Ü = UE
  • ß = ss

If your table/column is not already using it, you can force such collation in the query itself, e.g.:

mysql> SELECT _latin1'oe' collate latin1_german2_ci ='ö' AS are_equal;
+-----------+
| are_equal |
+-----------+
|         1 |
+-----------+
1 row in set (0.00 sec)

If your application is using Latin1 this should do the trick. Otherwise, I honestly have no idea :)

Disclaimer: I know nothing about German. There may be another language that uses similar rules.

Álvaro González
  • 142,137
  • 41
  • 261
  • 360
  • 1
    Thanks! This is really helpful. Unfortunately our application uses UTF8 ... – thomas Jul 28 '16 at 12:58
  • @thomas If you have the chance of testing this on UTF-8 please share your experience. As I said, I don't really know MySQL handles mixed encodings. – Álvaro González Jul 29 '16 at 06:09
  • 1
    I get an "illegal mix of collations" error. As you said, latin_german2_ci expects latin1 characters. – thomas Jul 29 '16 at 08:15
  • But this is really helpful for future projects. For now went for the ugly way of converting my search string and results via `REPLACE()`, something suggested here but later deleted (thanks to the unknown) – thomas Jul 29 '16 at 08:17
  • There's also [CONVERT()](https://dev.mysql.com/doc/refman/5.7/en/charset-convert.html) but I don't have experience with it. – Álvaro González Jul 29 '16 at 08:24
  • 1
    Thanks to your `CONVERT()` comment I *almost* had it working: `SELECT convert(_utf8'ausdruecklich' using latin1) COLLATE latin1_german2_ci = convert('ausdrücklich' using latin1) as equal` but then came this: http://stackoverflow.com/questions/7917367/mysql-why-are-collation-rules-ignored-by-like-operator-for-german-%C3%9F-character I guess I'm outta luck. Thanks again for your help! – thomas Jul 29 '16 at 09:00
0

If you are using utf8, COLLATE utf8_german2_ci is required. See collation chart (And, according to that chart, german2 is the only one to satisfy your needs.)

mysql> SELECT "oe" = "ö" COLLATE utf8_german2_ci;
+-------------------------------------+
| "oe" = "ö" COLLATE utf8_german2_ci  |
+-------------------------------------+
|                                   1 |
+-------------------------------------+

However, It is much more efficient to declare the column(s) to be COLLATE utf8_german2_ci than to use that clause in comparisons.

(If you are using utf8mb4, change the spelling accordingly.)

Rick James
  • 135,179
  • 13
  • 127
  • 222