ignore diacritics when searching

Question

I'm using Compass/Lucene to search and index my database. I want diacritics and character case to be ignore when I search, such that the query "foo" would match both "Fóo" and "foo" and a query for "fóó" would match "fóo" and "fOO".

Based on what I've read it seems that I need to change the default analyzer that Compass uses both when indexing and searching my context. I've found out where I specify the analyzer to use, but I can't seem to find an analyzer implementation that meets my requirements. Does there already exist an analyzer that ignores diacritics and character case, or do I need to write my own?

Recurse · Answer 1 · 2012-06-27T01:16:11.043

6

Take a look at org.apache.lucene.analysis.ASCIIFoldingFilter to see if it does what you want. If not, I would use its source as a starting point for writing your own.

You are right that you must use the same Analyzer configuration for indexing and querying, for the obvious reason that if you have stripped all the diacritics from the index, you need to strip them from any query also.

One thing to be aware of is to make sure you normalize any unicode somewhere in the indexing/querying process. For specifics see: http://unicode.org/reports/tr15/, http://unicode.org/faq/normalization.html, and http://docs.oracle.com/javase/6/docs/api/java/text/Normalizer.html.

EDIT: As mentioned in the comments below, as a Filter, you can't use ASCIIFoldingFilter as an Analyzer directly, however there are straight forward instructions on including it in an Analyzer here: stackoverflow.com/a/3834244/390153

EDIT: As mentioned by @jspboix in the comment below, you will also need to chain LowerCaseFilter to handle the character case.

edited Jun 27 '12 at 01:16

answered Jun 24 '12 at 23:55

Recurse

3,557
1
23
36

I don't think org.apache.lucene.analysis.ASCIIFoldingFilter will solve my problem directly, because it's not an Analyzer, but might be a useful starting point – Dónal Jun 25 '12 at 00:03
No, not directly as it is a Filter. However you will find a short example of how to incorporate it into an Analyzer here: http://stackoverflow.com/a/3834244/390153 – Recurse Jun 25 '12 at 00:20
1

You will also need LoweCaseFilter for matchin "fOO" with "foo". – jspboix Jun 26 '12 at 21:50
@jspboix Thank you, I forgot about that part of the question when I focused on the diacritics. – Recurse Jun 27 '12 at 01:13

score 0 · Answer 2 · answered Jun 28 '12 at 17:48

In my Grails application, I use the searchable plugin and just configured the system to use the "german" analyzer:

compassSettings = ['compass.engine.analyzer.default.type': 'German']

This ignores at least the case and umlauts - "ä" is stored as "a" in the index.

I just added "Fóo" and "Föo" to one of my test-documents and searched for "foo" - it finds "Föo" but not "Fóo". So I guess if you switch the language to the right value (French?) it should work.

ignore diacritics when searching

2 Answers2