6

The Problem: I am trying to implement a search algorithm that shows the results even when dotted chars are provided. In other words: SELECT 'über' = 'uber' or SELECT 'mas' = 'maş' these results will return true. This would apply for every single char in the following array:

$arr = array('ş' => 's', 'ç' => 'c', 'ö' => 'o', 'ü' => 'u' and so on ...);

The Solution In My Mind: Along with the original column, I can have a particular column that stores the English names. So before storing 'über' to database, I will also convert it to 'uber' in php and then will store both 'über' (as the original) and 'uber' (as the searchable) to the database.

But then, even though I've searched for this the whole day, I still believe that there should be a simplier and cleaner way to accomplish the task since this would mean (more or less) to store the same data twice in the database. So guys, what do you think is the solution the only way to go or you know a better approach?

EDIT

For accent insensitive I've seen the posts on SO, they are working but since I am also considering the dotted chars, I had to ask this question.

EDIT2

I cannot post the whole table structure and code exactly for some reasons but I'll provide a close example.

myusers | CREATE TABLE `myusers` (
id int auto_increment not null primary key,
email varchar(100) COLLATE latin1_general_ci not null,
fullname varchar(75) COLLATE latin1_general_ci not null)
PRIMARY KEY('id')
) ENGINE=MyISAM AUTO_INCREMENET=2 DEFAULT CHARSET=latin1 COLLATE latin1_general_ci |

The above is the structure of the table. Here comes the inserts and selects:

INSERT INTO myusers (fullname) VALUES ('Agüeda');
INSERT INTO myusers (fullname) VALUES ('Agueda');

SELECT * FROM myusers WHERE fullname = 'Agüeda' COLLATE latin1_general_ci 

+----+-------+----------+
| id | email | fullname |
+----+-------+----------+
|  1 |       | Agüeda   |
+----+-------+----------+
1 row in set (0.00 sec)

SELECT * FROM myusers WHERE fullname = 'agueda' COLLATE latin1_general_ci 

+----+-------+----------+
| id | email | fullname |
+----+-------+----------+
|  2 |       | Agueda   |
+----+-------+----------+
1 row in set (0.00 sec)

Well, the desired result is obviously when agueda is searched both 'Agueda' and 'Agüeda' will return, but that's not the case. As I mentioned above, I have created a new column and store the whole name in English characters and make the search from there as well. But still, it costs me a two times search (because I am also searching from the original columns which rank higher in the search result). There should be a better way...

Shaokan
  • 7,438
  • 15
  • 56
  • 80

4 Answers4

2

Just use an appropriate collation. For instance:

create table test(
    foo text
) collate = utf8_unicode_ci;
insert into test values('Agüeda');
insert into test values('Agueda');
select * from test where foo = 'Agueda';

This gives your two rows.

Artefacto
  • 96,375
  • 17
  • 202
  • 225
  • nope that does not work either. I've also tried with `foo = _utf8 'su' collate utf8_unicode_ci` but no result – Shaokan Oct 10 '11 at 10:32
  • Yes I've just changed it once again just to be sure – Shaokan Oct 10 '11 at 10:36
  • @Sha And you're not using the command line on Windows, I hope. – Artefacto Oct 10 '11 at 10:38
  • @Sha Try this: http://pastebin.com/ADS0SdTP using something like MySQL Workbench. In the command line, the correct codepage must be set, and even then... – Artefacto Oct 10 '11 at 10:39
  • nope I'm using ubuntu 10.04 :) and that code does not work either. Just to make sure, I'm using default mysql settings except that I've changed the charset to utf8 – Shaokan Oct 10 '11 at 10:41
  • @Shaokan: Try changing the table's charset to `CHARSET = 'utf8'` and the collation to `'utf8_unicode_ci'`. – ypercubeᵀᴹ Oct 11 '11 at 18:00
2

1) Write your own collation. latin1_general_diacriticinsensitive. I wouldn't even know where to begin, though :).

2) Use regex and character groups: /[uü]ber/

3) The Solution In Your Mind. I'd personally use this, since design is all about compromise and this is a simple solution with just a 100% space overhead. Granted, the space overhead might eventually turn into a speed overhead, especially with MySQL, but that's to worry about later. This is also very easy to undo if need be.

aib
  • 45,516
  • 10
  • 73
  • 79
0

Well, instead of trying to replace them and run the search the x-times, I'd suggest using the mysql function LIKE i.e.

SELECT * FROM x WHERE search LIKE '%ber'

Where you have to replace the diacritics with"%.

EDIT: My mistake % replaces any number of characters. Use _ for a single char.

Muhammad Waheed
  • 1,048
  • 1
  • 13
  • 30
redshark1802
  • 894
  • 2
  • 12
  • 22
0

Take a look at this post: https://stackoverflow.com/questions/500826

He has just the opposite issue you're facing. Look at the WHERE clause in the selected answer. Probably you could just use the _ci suffix and it'll work.

Let us know how this is resolved.

Community
  • 1
  • 1
Mosty Mostacho
  • 42,742
  • 16
  • 96
  • 123
  • I've already tried that and it doesn't work for me. When I do `select _utf8 'ş' collate utf8_general_ci = _utf8 's' ` it works but when I apply this to a column, it doesnt work. So I guess I'm following the solution in my mind which I depicted above – Shaokan Oct 10 '11 at 09:59