3

How do I make a diacritic insensitive,

ex this persian string with diacritics

هواى بَر آفتابِ بارِز

is not the same as with removed diacritics in mySql

هواى بر آفتاب بارز

Is there a way of telling mysql to ignore the diacritics or do I have to remove all the diacritics in my fields manually?

OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
maccen
  • 155
  • 1
  • 3
  • 9

5 Answers5

3

It's a bit like case-insensitivity problem.

SELECT * FROM blah WHERE UPPER(foo) = "THOMAS"

Just convert both strings to diacritic-free before comparing.

Artelius
  • 48,337
  • 13
  • 89
  • 105
  • (Not sure how to remove diacritics in SQL, perhaps someone else can help?) – Artelius Oct 18 '09 at 10:41
  • Actually it's a little more complex than case-sensitivity or insensitivity. Unless a software includes support for specific scripts (which MySQL doesn't seem to have for Persian) you're out of luck. Or you're in for some work writing a custom function to do it for you; won't be pretty either way. – Joey Oct 18 '09 at 10:57
  • It works in MySQL and Hebrew: (SELECT * FROM `dictionary` WHERE upper(nekudot)="כדי";) finds כְּדֵי. Thanks! – xerostomus Dec 09 '21 at 07:16
2

I'm using utf8 (utf8_general_ci) and searching arabic without diacritics doesn't work, it isn't insensitive or it is but don't work properly.

I tried looking at the character with and without a diacritic using Hex and it looks like mysql considering it as two distinct character.

I'm thinking about using hex and replace (a lot of replace) to search for words while filtering diacritics.

my solution to have an insensitive search for arabic words:

SELECT arabic_word FROM Word
WHERE
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(HEX(REPLACE(
arabic_word, "-", "")), "D98E", ""), "D98B", ""), "D98F", ""), "D98C", 
""),"D991",""),"D992",""),"D990",""),"D98D","") LIKE ?', '%'.$search.'%'

the values formatted in hexadecimal are diacritics that we want to filter. ugly but I didn't find another anwser.

vsen
  • 21
  • 3
0

The cleanest solution I've come to is:

SELECT arabic_word 
FROM Word
WHERE ( arabic_word REGEXP '{$search}' OR SOUNDEX( arabic_word ) = SOUNDEX( '{$search}' ) );

I haven't checked the costs of the SOUNDEX function. I guess this could for small tables, but not for large datasets.

j0k
  • 22,600
  • 28
  • 79
  • 90
Marcos
  • 111
  • 1
  • 2
  • 1
    This will do better: SET NAMES utf8; SELECT arabic_word FROM Word WHERE arabic_word LIKE '{$search}'; I'm used to use REGEXP, but in this case LIKE is the key. Must read: http://stackoverflow.com/questions/3304464/mysql-diacritic-insensitive-search-spanish-accents – Marcos Aug 22 '12 at 14:52
0

Did you already read all of MySQL Character Set Support to check if the answer to your question isn't already there? Especially collations are to be understood.

I wild guess is that using utf8_general_ci could do the right thing for you

jitter
  • 53,475
  • 11
  • 111
  • 124
  • http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html has a comment at the very bottom which addresses a similar issue for Hebrew, noting that MySQL has problems regardless which collation is used, though different problems. The issue for Persian might be similar. – Joey Oct 18 '09 at 10:52
0

Setting

set names 'utf8'

before making a query usually does the trick for latin lookups. I'm not sure if this works for arabic as well.

khel
  • 2,358
  • 2
  • 19
  • 8