Ignore special characters in Examine

Question

In Umbraco, I use Examine to search in the website but the content is in french. Everything works fine except when I search for "Français" it's not the same result as "Francais". Is there a way to ignore those french characters? I try to find a FrenchAnalyser for Leucene/Examine but did not found anything. I use Fuzzy so it return results even if the words is not the same.

Here's the code of my search :

public static ISearchResults Search(string searchTerm)
        {
            var provider = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
            var criteria = provider.CreateSearchCriteria(BooleanOperation.Or);

            var crawl = criteria.GroupedOr(BoostedSearchableFields, searchTerm.Boost(15))
            .Or().GroupedOr(BoostedSearchableFields, searchTerm.Fuzzy(Fuzziness))
            .Or().GroupedOr(SearchableFields, searchTerm.Fuzzy(Fuzziness))
            .Not().Field("umbracoNavHide", "1");

            return provider.Search(crawl.Compile());
        }

I know I will not be very helpful but, if there is a way, you can transform all special char in a normal char in the content you are searching it. — provençal le breton, Jan 23 '14 at 15:45
Why can't you do the replacing of the characters? I really don't see any other way. Assuming you have already checked all method overloads.. — Jevgeni Geurtsen, May 22 '14 at 17:44
The problem is not when I search for "Français" is when I search for "Francais" without the special character. I don't get any result. It looks like the index is built with the specials characters but it should return the result even if I search for the word without the special characters. — VinnyG, May 22 '14 at 17:46

score 1 · Accepted Answer · answered May 22 '14 at 19:58

We ended up using a custom analyer based on the SnowballAnalyzer

public class CustomAnalyzer : SnowballAnalyzer
{
    public CustomAnalyzer() : base("French") { }

    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {
        TokenStream result = base.TokenStream(fieldName, reader);

        result = new ISOLatin1AccentFilter(result);

        return result;
    }
}

score 0 · Answer 2 · answered May 26 '14 at 05:26

0

Try using Regex like this below:

var strInput ="Français";
var strToReplace = string.Empty;
var sNewString = Regex.Replace(strInput, "[^A-Za-z0-9]", strToReplace);

I've used this pattern "[^A-Za-z0-9]" to replace all non-alphanumeric string with a blank.

Hope it helps.

answered May 26 '14 at 05:26

Israel Ocbina

517
8
14

1

Thanks Israel but the problem is the other way around, lucente.net index all the content with the "ç" and when I do a search for "c" I want the results to include those with the "ç". – VinnyG May 26 '14 at 14:42

score 0 · Answer 3 · answered Dec 13 '22 at 17:46

You can actually convert the unicode characters with diacritics to english equivalents using the following method. That will enable you to search for "Français" with the search term "Francais".

public static string RemoveDiacritics(this string text)
{
    if (string.IsNullOrWhiteSpace(text))
        return text;

    text = text.Normalize(NormalizationForm.FormD);
    var chars = text.Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark).ToArray();

    return new string(chars).Normalize(NormalizationForm.FormC);
}

Use it on any string like this:

var converted = unicodeString.RemoveDiacritics();

Ignore special characters in Examine

3 Answers3

Linked