0

I have made a dictionary with about 100k words of Punjabi Language in Unicode. There is a letter , whose code in unicode is ਸ਼ and there are many such letters like . But in this language, the dot u see under the letters can also be typed separately, but there are combined letters in unicode. in the db, there are words in word table and the md5 of the word in word_hash. When i try to search the database with php with the statement SELECT * FROM db WHERE word_hash = md5('word');, it results in no records found with words with such letters with the dot. When i tried to search, i found that the md5 of the words in the db and the md5 generated by search syntax is different. Why is it so? I have entered all the words through a textbox and the md5 entered is with mysql syntax.

For ex : the code for the word ਸ਼ਰਬਤ is 45f756f02a28b5ec48ddf369db6ad7e6 echoed by mysql query and in the db is d6da1a44526c5ab1259dcc05404b1e8c

Two alternates for are ਸ਼ and ਸ਼

shahbaz
  • 372
  • 1
  • 3
  • 18
  • have you look at http://stackoverflow.com/questions/2446778/how-to-display-unicode-data-with-php?rq=1 – nKandel Jul 21 '13 at 08:53
  • 3
    Why do you search with the hash instead of the word? You should store the words in a [normalized form](http://unicode.org/reports/tr15/). Then you can search with the word instead of the hash. – Gumbo Jul 21 '13 at 08:54
  • i need to search with md5 because when i do it in a normalized form, it considers the letter with and without the dot same..so if there is a word without the dot, it shows the results even if we type the dot... – shahbaz Jul 21 '13 at 09:01

1 Answers1

1

What you have here are the different Unicode normalization forms. There are combined characters, where a base character is combined with a diacritic or other character to form an alternate version, but sometimes this alternative version may also exist as a standalone character. E.g.:

ਸ਼ GURMUKHI LETTER SHA (U+0A36)
ਸ GURMUKHI LETTER SA  (U+0A38)
 ਼ GURMUKHI SIGN NUKTA (U+0A3C)
ਸ +  ਼ (U+0A38 + U+0A3C) equivalent to ਸ਼ U+0A36

(I'm not actually sure if the GURMUKHI SIGN NUKTA is the correct combining dot here, since I don't know Gurmukhi, but you get the idea.)

For storage and comparison, you should decide on one form or the other, since it's often impossible to predict which format the input will be in. You do this using the Unicode Normalization process, which converts between both forms. In PHP you do this with the Normalizer class.

i need to search with md5 because when i do it in a normalized form, it considers the letter with and without the dot same..

You second problem is that you're inventing an overcomplicated solution to a simple problem: collations. The database uses collation rules for "fuzzy" matching, i.e. to treat "matinee" and "matineé" the same, or in your case "ਸ਼" and "ਸ". You set the default collation on the column, but you can influence it during query time as well:

SELECT ... WHERE foo = 'bar' COLLATE utf8_bin;

If you want absolute matches, use the utf8_bin collation or another equivalent _bin (binary) collation for your chosen encoding.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • 1
    I'll add that even when normalizing to NFC, `U+0A36` will become `U+0A38 + U+0A3C`, because the character is listed as a [composition exclusion](http://www.unicode.org/reports/tr15/#Primary_Exclusion_List_Table). That *might* explain the "considers the letter with and without the dot same" comment, though I'm not sure. – PleaseStand Jul 21 '13 at 09:34
  • thxx fr ur answer...but when i change the encoding to utf8_bin...it doesnt even display a single record... – shahbaz Jul 21 '13 at 10:07
  • 1
    @sha *Collation!* The encoding is `utf8`, but what you're worried about is the *collation* `utf8_bin`. Anyway, if it doesn't match that's because the binary (`_bin`) value is different. Have you normalized both the search value and the value in the database to the same normalization form as discussed above? – deceze Jul 21 '13 at 10:19
  • i m not able to use the `normalizer` class....i've enabled the `php_intl` extension in wampserver....but `normalizer_normalise()` is still not working... – shahbaz Jul 22 '13 at 09:37