Accent insensitive search on a problematic database

Question

I have a database that contains data in different languages. Some languages use accents (like áéíóú) and I need to search in this data as the accents doesn't exist (search for 'campeon' should return 'campeón' as a valir result).

The problem is that the tables in my database (utf8_unicode_ci) are not storing utf8 characteres. If you see the data through phpmyadmin the words with accents looks like this: campeÃ³n

After some researching, I've found (in a StackOverflow question) that the problem is related to the inexistence of a SET NAMES [charset]. In fact, I've made some testings and if I set names to utf8, everything works as expected.

Well, I have the solution, what's the problem? The problem is that the database is in production, so there are thousands of strings in the database. If I change the character set the client will use, all already existing string will become invalid. The question is: is there any way to:

perform accent-insensitive searches in a database that uses a wrong charset like mine?
transform safely the data in the tables to the appropriate charset?
continue working with mixed charsets (latin1 and utf8) in the database, assuming that latin1 data will not be accent-insensitive?

If anybody has experience in any of the solutions I propose or has a new one, I'll be very thankful if share.

Are you confident that phpmyadmin isn't butchering the data? I personally wouldn't trust it far... — sarnold, Mar 14 '12 at 09:52
I also tested on the command line (mysql command) and is the same result — Ivan, Mar 14 '12 at 10:10

score 1 · Accepted Answer · answered Mar 14 '12 at 09:58

1

The problem being that the data was inserted using the wrong connection encoding, you can fix it by

Exporting the data using the wrong connection encoding, just like you have used it thus far, followed by
Importing the data using the correct utf8 connection encoding.

That will fix the encoding problem, after which search will work as expected.

answered Mar 14 '12 at 09:58

deceze

510,633
85
743
889

Ummhh, it seems easy. When you say 'exporting' and 'importing' are you talking about the export and import commands on the phpmyadmin or about a custom php snippet? – Ivan Mar 14 '12 at 10:12
phpmyadmin is probably the worst choice, since it handles encodings automatically, which is what you don't want in this case. A small PHP script should do it, otherwise mysqldump or similar manual queries. – deceze Mar 14 '12 at 10:55
I'm looking at mysqldump, but cannot found any way to force the importing with the encoding I need, how to do it? – Ivan Mar 14 '12 at 11:05
I think `--default-character-set` might do it, but I'm not 100% sure. `mysqldump` can be problematic in that it usually replaces the whole database when reimporting. Maybe a PHP script that puts all the necessary data into an array, then runs `UPDATE` queries for the existing data may be better. That depends on your particular situation though. – deceze Mar 14 '12 at 11:32

score 0 · Answer 2 · answered Mar 14 '12 at 09:58

0

What if you create a copy of the table at the beginning of your session, alter the copy's charset, perform all your queries from that, and then drop the table at the end of your session? I don't know how practical this would be - depends on how often you need to perform these queries and how big the table is.

answered Mar 14 '12 at 09:58

JTeagle

2,196
14
15

I'm afraid the queries are essential and this will reduce drastically the performance – Ivan Mar 14 '12 at 10:13
Understood - sorry I couldn't help. – JTeagle Mar 14 '12 at 10:17

Accent insensitive search on a problematic database

2 Answers2