As I am sure a lot of people here are aware, having to deal with German umlauts and UTF8 collations can be problematic to say the least. Stuff like a = ä
, o = ö
, u = ü
is not only capable of affecting the sort order of the results but the actual results as well. Here is an example which clearly demonstrates how things can go wrong by simply trying to make a distinction between a singular and plural version of a noun (Bademantel
- singular, Bademäntel
- plural).
CREATE TABLE keywords (
id INT (11) PRIMARY KEY AUTO_INCREMENT,
keyword VARCHAR (255) NOT NULL
) ENGINE = MyISAM DEFAULT CHARACTER
SET = utf8 COLLATE = utf8_unicode_ci;
INSERT INTO keywords (keyword) VALUES ('Bademantel'), ('Bademäntel');
SELECT * FROM keywords WHERE keyword LIKE ('%Bademäntel%');
Results should be
+----+------------+
| id | keyword |
+----+------------+
| 1 | Bademäntel |
+----+------------+
yet with utf8_unicode_ci
the output is
+----+------------+
| id | keyword |
+----+------------+
| 1 | Bademantel |
| 2 | Bademäntel |
+----+------------+
which is clearly not the required result.
The actual problem is tied for my current project. It involves writing a keyword parser which is basically supposed to replace every occurrence of a keyword on the website with a link to the appropriate product page. In order to avoid unnecessary waste of resources only distinct keywords are fetched but using either
SELECT keyword FROM keywords GROUP BY keyword ORDER BY LENGTH(keyword) DESC
or
SELECT DISTINCT keyword FROM keywords ORDER BY LENGTH(keyword) DESC
will result in failing to process (link) all the non-umlaut versions of the words simply because they are not fetched during the query (i.e. all the keywords containing Bademäntel
will be fetched but Bademantel
will be omitted).
Now I realize that I have a couple of options to resolve this problem.
1) Use utf8_swedish_ci
for the keywords table or during the queries which would effectively save me from having to modify a lot of existing code.
SELECT DISTINCT keyword COLLATE utf8_swedish_ci AS keyword FROM keywords ORDER BY LENGTH(keyword) DESC;
Unfortunately I am not that reluctant to abandon utf8_unicode_ci
because a) it offers a really nice feature of sorting "Eszett" (ss
and ß
are considered the same), b) somehow it simply feels wrong to use a Swedish collation to handle German related stuff.
2) Modify the existing code to make use of utf8_bin
.
SELECT DISTINCT keyword COLLATE utf8_bin AS keyword FROM keywords ORDER BY LENGTH(keyword) DESC;
This works as intended but it has a nasty drawback that all comparison is case-sensitive which means that if I decided to rely on utf8_bin
as a solution for the problem I would have a hard time doing case-insensitive queries like LIKE('%Mäntel%')
which would most definitely omit records like Bademäntel
.
I know that this question pops every now and then on SO but some of the answers are now pretty old and I just want to know if there is some other solution that might have emerged in the meantime. I mean, I really can't get around the thought that a simple collation is allowed to completely change the results of a query. Sorting order yes, but the results itself?
Sorry for a bit longer post and thanks in advance for any kind of advice or comment.