22

I have a schema in "utf8 -- UTF-8 Unicode" as charset and a collation of "utf8_spanish_ci".

All the inside tables are InnoDB with same charset and collation as mentioned.

Here comes the problem:

with a query like

SELECT *
FROM people p
WHERE p.NAME LIKE '%jose%';

I get 83 result rows. I should have 84 results, because I know it.

Changing where for:

WHERE p.NAME LIKE '%JOSE%';

I get the exact same 83 rows. With combinations like JoSe, Jose, JOSe, etc. All the same 83 rows are reported.

The problem comes when accents play in game. If do:

WHERE p.NAME LIKE '%josé%';

I get no results. 0 rows.

But if I do:

WHERE p.NAME LIKE '%JOSÉ%';

I get just one resulting row, so 1 row. This is the only row which has accented "jose" and capitalized.

I've tried with josÉ, or JoSÉ or whatever combination I do, as long as the accented letter stays capitalized or not, as it really is stored in the database and it stills returning the only row. If I suddenly change "É" for "é" in whatever combination I do with the capitalization in JOSE, it returns no rows.

So conclusions:

  • Case insensitive if no latin characters plays in game.
  • Case sensitive if latin characters appears.
  • Accent sensitive, as if I search JOSE or jose, I only get 83 rows, instead of the 84 rows I need.

What I want?

  • To search "jose", "JOSE", "José", "JOSÉ", "JÒSE", "jöse", "JoSÈ", ... have to return the 84 rows I know that exists. I what to turn my searches to case insensitive and "latin" insensitive.

Solutions like COLLATION on LIKE doesn't work for me, don't know why...

What can I do?

EDIT:

If I do something like:

WHERE p.NAME LIKE '%jose%' COLLATE utf8_general_ci;

I get the error:

COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1'

And I've changed all the possible collations on the columns too!

And if I do something like:

WHERE p.NAME LIKE _utf8 '%jose%' COLLATE utf8_general_ci;

The same 83 rows are reported, as if I've made nothing...

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Lightworker
  • 593
  • 1
  • 5
  • 18
  • Have you found a solution for this? I am struglling with the same problem in a database with the character 'ã' and 'Ã'. If I search for 'joao', it will retrieve 'joao' and 'Joao', meaning it finds words with different casing, but not words with accents. If I search for 'João', it will not retrieve records where the A is capitalized as 'JOÃO'. I have even tried creating a table with 4 fields with 4 different collations (ut8_general_ci,utf8_unicode_ci,utf8_bin and latin1_general_ci) to insert the same items and try to search using each of the columns, but haven't found a solution... – Miguel Mesquita Alfaiate Jul 19 '13 at 15:06

2 Answers2

11

You have already tried to use an accent-insensitive collation for your search and ordering.

http://dev.mysql.com/doc/refman/5.0/en/charset-collation-implementations.html

The thing is, your NAME column seems to be stored in the latin1 (8-bit) character set. That's why mySQL is grumbling at you like this:

  COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1'

You may get the results you want if you try

 WHERE CONVERT(p.NAME USING utf8) LIKE _utf8 '%jose%' COLLATE utf8_general_ci;

But, be careful!

When you use any kind of function (in this example, CONVERT) on the column in a WHERE statement, you defeat MySQL's attempts to optimize your search with indexes. If this project is going to get large (that is, if you will have lots of rows in your tables) you need to store your data in utf8 format, not latin1. (You probably already know that your LIKE '%whatever%' search term also defeats MySQL's indexing.)

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • Thanks, but it doesn't work. My schema right now is in utf8_general_ci. The connection is in UTF8 too. Tables are in utf8_general_ci, and all the columns from table are on the same charset and collation. Nothing in every point is declared as Latin1. Thing is, that I've tried all the possible combinations between charset and collation between conection, schema, tables and columns, and nothing changes. With your solution, if I write "jose"/"josé"/"JOSE"/"JOSÉ"/"JOSÈ"/... always return 83 rows! So it really becomes accent and case insensitive, but it doesn't returns the real accented rows! – Lightworker May 31 '12 at 14:04
  • OK. Next step in troubleshooting: Is there any chance your data is encoded somehow else than UTF8? try doing `SELECT NAME, LENGTH(NAME) from p where p.name like '%jos%'` and see if "José" contains the number of characters you expect. Also try a different MySQL client program. Especially if you're using a web client program like phpmyadmin, it's possible your database has HTML entities like ampersand poundsign x e 9 semicolon in that NAME column. That will totally baffle accent insensitivity. – O. Jones May 31 '12 at 18:59
  • Thanks! You saved me. – JackDavis Sep 22 '15 at 18:31
  • For future searchers, you can apply `CONVERT` to the value (`'%jose%'` in this case) instead of the column (`p.name`), and then you will get to use your indexes. – Charles Wood Sep 27 '21 at 15:28
8

Just in case someone else stumbles upon this issue, I have found a way that solves the problem, at least for me.

I am using PHP to insert and retrieve records from the database. Even though my Database, tables and columns are utf8, as well as the encoding of the PHP files, the truth is that the encoding used in the connection between PHP and MySQL is being made using latin1. I managed to find this using

$mysqli->character_set_name();

where $mysqli is your object.

For the searches to start working as expected, returning accent insensitive and case insentive records for characters with accents or not, I have to explicitly set the character set of the connection.

To do this, you just have to do the following:

$mysqli->set_charset('utf8');

where $mysqli is your mysqli object. If you have a database management class that wraps your database functionality, this is easy to apply to a complete app. If not, you have to set this explicitly everywhere you open a connection.

I hope this helps someone out, as I was already freaking out about this!

Miguel Mesquita Alfaiate
  • 2,851
  • 5
  • 30
  • 56