3

I am creating a search utility that will search elements in a SQLITE DB, Only problem is, that the DB contains some characters like Å è ô which are in Latin...

Is there an easy way to ignore these letters and treat them as there English alphabet counterparts (Å = A, è = e ...) ?

I thought of designing 1 to 1 mappings of all such characters something like,

    HashMap<Character, Character> lstOfChar = new HashMap<Character, Character>();
    lstOfChar.put('Å', 'A');
    lstOfChar.put('è', 'e');

And when retrieving data from database each of such character will be replaced by there English alphabet equivalent and search result will be displayed.

If I am searching Deepak then the rows containing DeepÅk or DÈepak or deepÃk should be searched

But it will be long process and maintenance will be hard too.

Is there some elegant way, may be SQLITE provide some functionality or is it possible through SQL.
I am using Java platform.

EDIT I found the Normalizer in posted answer do help to do the thing programmatic after fetching the results but can this be done by the database or through firing SQL in some special way, as it takes lot of time to fetch the results and apply this function and return result.

David R Tribble
  • 11,918
  • 5
  • 42
  • 52
Deepak Bhatia
  • 6,230
  • 2
  • 24
  • 58
  • Why do you want to get rid of non-ASCII characters? If you repace them, you might also change the meaning of the words. – isnot2bad Nov 16 '13 at 13:54
  • I am not changing the actual value in database rather fetching the results and then showing them on UI, I just want to search – Deepak Bhatia Nov 16 '13 at 13:56
  • Maybe a Collator can help: http://docs.oracle.com/javase/7/docs/api/java/text/Collator.html – isnot2bad Nov 16 '13 at 14:04

1 Answers1

5

You're looking for Normalizer. It allows you to map between accented Unicode characters and their decompositions:

Normalizer.normalize(text, Normalizer.Form.NFD).replaceAll(
    "\\p{InCombiningDiacriticalMarks}+", "");

I'd recommend you do this beforehand, since it is expensive, and chuck it in a "name_search" column or something similar. It's worth noting that this will not capture all non-"English" letters, because they don't always have decompositions, but the basics like é and  will be matched.

Regarding your edit: as I said, the best way to do this is to apply the transformation once, to all rows, and put the result in an additional database column called "name_search" or something similar. It's similar to creating an index on your table. There is no equivalent to the Normalizer's functionality in SQL, and while some RDBMS system might offer one, SQLite certainly does not.

asthasr
  • 9,125
  • 1
  • 29
  • 43
  • It helped but I am curious to know Will this change all the Unicode characters http://en.wikipedia.org/wiki/List_of_Unicode_characters to English equivalent – Deepak Bhatia Nov 16 '13 at 14:32
  • 1
    No. Only the Unicode characters that *have* a decomposition will be decomposed. If the character does not consist of a base letter plus diacritics, there is no way to convert it to an "English" equivalent because there *isn't* an English equivalent. In fact, there are thousands of characters that consist of a different alphabet's character plus that language's combining diacritics. – asthasr Nov 16 '13 at 14:36
  • 1
    it means while inserting each entry I should add one more searchable value associated with it, Firstly, I am not allowed to change the database. Secondly, I have big database with millions of records already. Thirdly, I do not want to add a new column. Any other alternative????? – Deepak Bhatia Nov 16 '13 at 14:53
  • The only option you have while using SQLite is to pre-calculate the searches or to live with it being very slow. (But SQLite with a big database scares me in any case.) – asthasr Nov 16 '13 at 15:06
  • It means, map the normalization across all your data and store the result. – asthasr Nov 16 '13 at 15:07
  • I hope you enjoy confusing fathers and potatoes in Spanish. – bmargulies Nov 16 '13 at 15:09
  • @bmargulies sorry I could not understand the meaning of it – Deepak Bhatia Nov 16 '13 at 15:14
  • "papa" is one way of saying potato in Spanish @dbw; not to be confused with "papá", meaning father. By removing accents and diacritics you'll have some words that mean wildly differing thing suddenly becoming the same word. – Ben Nov 17 '13 at 17:25