0

I have created an autocomplete that matches against a list of names in a database.

The database that I'm working contains a ton of names with special characters, but the end users are most likely going to search with the English equivalent of those names, e.g. Bela Bartok for Béla Bartók and Dvorak for Dvořák, etc. Currently, doing the English searches returns no results.

I have come across threads saying that the way to solve this is to change your MySQL collation to utf8 (which I have done to no avail).

I think that this may be because I used utf8_unicode_ci, but the one that would get the results that I want is utf8_general_ci. The problem with the latter though is that all the comments say to no longer use it.

Does anyone know how I can solve this problem?

Francesco Casula
  • 26,184
  • 15
  • 132
  • 131
Luke Madhanga
  • 6,871
  • 2
  • 43
  • 47
  • 2
    Different characters stay different characters. There is no such thing as "special characters". What should be special about them? That they are less known to some people? – arkascha Aug 13 '15 at 12:19
  • @arkascha if they're not in the ASCII that the majority of the world uses, some may argue that it is a "special character". E.g. during the world cup, everyone's favourite German Özil had his name spelled by Oezil by the BBC because that's how it is transliterated – Luke Madhanga Aug 13 '15 at 13:04
  • @arkascha when you Google search ozil you get Özil as a result – Luke Madhanga Aug 13 '15 at 13:05
  • That is true, but that does not mean that the characters are identical. Nor does it explain what should be "special" about the umlaut "Ö". The solution indeed is to use a suitable collation, since a collations defines a "mapping" of characters into others. It does _not_ make or treat characters identically. That would have pretty funny effects. – arkascha Aug 13 '15 at 13:21

6 Answers6

0

If you know the list of special characters and what the equivalents in plain English are, than you can do the following:

  1. lower case the string
  2. replace the characters with the lower case equivalents
  3. search against that "plain English" column

You will need to use the full text searching of MySQL in order to search against the text or come up with a home grown solution for how you're going to handle that.

Hatem Jaber
  • 2,341
  • 2
  • 22
  • 38
0

Just tested with both utf8_general_ci and utf8_unicode_ci collations and it worked like a charm in both cases.

Follows the MySQL code I used to run my test:

CREATE TABLE `test` (
 `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
 `text` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
 PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

INSERT INTO `test` (`id`, `text`) VALUES (NULL, 'Dvořák'), (NULL, 'Béla Bartók');

SELECT * FROM `test` WHERE `text` LIKE '%dvorak%';

The above SELECT statement returns:

id      text
--------------
1       Dvořák

Note: During my test I set all the collations to the desired one. The database collation, the table collation and the column collation as well.

Could it be that there's a bug in your PHP application?

Francesco Casula
  • 26,184
  • 15
  • 132
  • 131
  • If it worked for you then there may well be a bug in the PHP. I just saw your edit about column collation which I have not set. I'll test that. Thanks – Luke Madhanga Aug 13 '15 at 13:00
  • I suggest you try my MySQL snippet in your environment without passing through PHP. Just the MySQL via shell. If it works like it worked with me (and it will!) then you can exclude is MySQL issue :-) – Francesco Casula Aug 13 '15 at 13:02
  • You're most correct. The above worked perfectly fine outside of PHP, my query too. However, I just tested something else: as I needed to have a word boundary, I used regexp instead of like, which does **not** work in this case! – Luke Madhanga Aug 13 '15 at 13:15
0

I found the solution to my problem. Changing the collation to utf8_unicode_ci works perfectly fine. My problem was that I needed to use REGEXP in my query instead of LIKE, but REGEXP obviously doesn't work in this situation!

So in short, changing your collation to utf8_unicode_ci will allow you to compare Dvorak and Dvořák using = or LIKE, but not one of the REGEXP equivalents.

Luke Madhanga
  • 6,871
  • 2
  • 43
  • 47
  • This doesn't look like a solution to me. You're just saying that your code wasn't working because you were using `REGEXP` instead of `LIKE`. So what's the solution? How would it possible to achieve the same goal with `REGEXP`? If it's not possible you should write that in your answer perhaps pointing to some official documentation. – Francesco Casula Aug 13 '15 at 14:25
  • @francescoCasula yeah it was a code error on my part – Luke Madhanga Aug 13 '15 at 14:27
0

First, let's see if the data is stored correctly. Do

SELECT name, HEX(name) FROM ... WHERE ...;

Béla may come out (ignoring the spaces)

42 C3A9 6C 61 -- if correctly encoded with utf8 (é = C3A9)
42  E9  6C 61 -- if encoded with latin1 (é = E9)

The "Collation" (utf8_general_ci or utf8_unicode_ci) makes no difference for the examples you gave. Both tread é = e. See extensive list of equivalences for utf8 collations.

After you determine the encoding, we can proceed to prescribe a cure.

Rick James
  • 135,179
  • 13
  • 127
  • 222
0

Taking a hint from Rick James, using:

SELECT * FROM `test` WHERE HEX(`column`) = HEX('Dvořák');

Should work. If you need a case insensitive query, then you'll need to lower/upper both sides in addition to the HEX check.

0

A more up to date collation is utf8mb4_unicode_520_ci.

Note, it does NOT work for utf8mb4_unicode_ci. See the comparison here: https://stackoverflow.com/a/59805600/857113

Frank Forte
  • 2,031
  • 20
  • 19