2

Edit: if you're here because you're confused by the polish collation in MySQL, read this.

I'm trying to perform a full-text search on a table of polish cities and many of them contain accented characters. It's meant to be used in an ajax call for auto completion so it would be nice if the search was accent-insensitive. I've set the collation of the rows to ut8_polish_ci. Now, given the city "Zelów", I query the database like this

SELECT * FROMcitiesWHERE MATCH( city ) AGAINST ("zelow")

but to no avail. Mysql returns an empty result. I've tried different accents, tried adding different collations to the query but nothing helped. I'm not sure how I should approach this because accent-sensitivity seems to be poorly documented. Any ideas?


EDIT

So I found out that the case-insensitive full-text searches are performed only IN BOOLEAN MODE, so the correct query would be

SELECT * FROMcitiesWHERE MATCH( city ) AGAINST ( "zelow" IN BOOLEAN MODE )

Previously I thought otherwise due to a misleading comment on dev.mysql.com. There might be more to it but I'm just really confused right now.

Anyway, as mentioned in the comments below, I have UNIQUE index on the cities column so changing the collation of the table to accent-insensitive utf8_general_ci is out of the question.

I realized however, that the following query works quite well on a table with utf8_polish_ci collation:

SELECT * FROMcitiesWHERE city LIKE 'zelow' COLLATE utf8_general_ci

It would seem now that the most reasonable solution would be to do a full-text search in a similar fashion:

SELECT * FROMcitiesWHERE MATCH( city ) AGAINST ( 'zelow' IN BOOLEAN MODE ) COLLATE utf8_genral_ci

This however yields the following error:

#1253 - COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'

This is really starting to get on my nerves. Might as well abandon full-text search in favour of a simple where-like approach but it doesn't seem sensible in a table with almost 50k records which will be intensively queried...


LAST EDIT Ok, the thing with boolean mode was partly bullshit. Only partly because it indeed works as I said, however, on a utf8_general_ci it works the other way around. I'm utterly perplexed and have no will to study this issue further. I decided to drop the UNIQUE index (no further cities will be added anyway so no need to make a big deal out of it) and stick with the utf8_general_ci table collation. I appreciate all the help, it steered me in the right direction.
Marcin Wasilewski
  • 685
  • 1
  • 10
  • 26

4 Answers4

3

Change your collation to utf_general_ci. It ignores accent when searching and ordering but still stores them correctly.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
dee-see
  • 23,668
  • 5
  • 58
  • 91
  • The problem is, the column which stores city names has a unique key. What about cities like Ostrowik and Ostrówik which are completely different? If I change the collation to utf8_general_ci, mysql complains that there's a duplicate entry on this column. So I'd like to keep current collation in the table but only perform a accent-insensitive search. I tried adding "COLLATE utf8_general_ci" at the end of the query, but mysql says "#1253 - COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'" – Marcin Wasilewski Aug 19 '11 at 21:11
  • See if [this question](http://stackoverflow.com/questions/901066/mysql-case-sensitive-search-for-utf8-bin-field) can help you. – dee-see Aug 19 '11 at 21:22
  • Thanks but I've already googled that :(. See the original post for the update. – Marcin Wasilewski Aug 19 '11 at 22:07
2

MySQL is very flexible in the encoding/collation area, maybe too flexible. When changing your encoding/collation, make sure you are converting the table, not just changing the encoding/collation types.

ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

You can also convert individual fields, so your table can have a collation setting of utf8_general_ci, but you can change one or more fields so they use some other collation. Base on the "binary" error you are seeing, it seems your text field might have a collation of UTF8-BIN (or be a blob). Can you post the result of CREATE TABLE?

Remember, the CHARACTER SET (encoding) is how the data is stored, the collation is how it is indexed. Not all combinations work.

My original problem, and question, might help a little: Converting mysql tables from latin1 to utf8

Community
  • 1
  • 1
Brent Baisley
  • 12,641
  • 2
  • 26
  • 39
  • We are talking about slightly different things. I want my table to be utf8_polish_ci to assure correct storing and sorting but still be able perform a full-text search with a different collation. The character set of the table IS utf8. It is not possible which might be either a mysql bug or a feature I failed to understand. You are welcome to try it yourself: CREATE TABLE `test` (`name` VARCHAR( 10 ) NOT NULL) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_polish_ci; and try to perform a full-text search SELECT * FROM `test` WHERE MATCH(`name`) AGAINST (`whatever`) COLLATE utf8_general_ci; – Marcin Wasilewski Aug 20 '11 at 13:41
  • What you probably want to do is change the field to be utf8_polish_ci. So the table would be utf8_general_ci, but the field would be a different collation. You can do an alter table and just change the collation for the one field. Although UTF8 generally handles accented characters in the way you want. – Brent Baisley Aug 21 '11 at 04:18
0

Change your collation from binary to utf8_bin. utf8_bin should be compatible with utf8_general_ci, but will still allow you to store city names with differing accents.

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
0

If you try :

select * from cities where cityname like 'zelow'
Cygnusx1
  • 5,329
  • 2
  • 27
  • 39
  • 1
    Do as Vache said... i checked my database and the collation was utf_general_ci. and everything worked as expected – Cygnusx1 Aug 19 '11 at 19:54