Hibernate Search with Autocomplete and Fuzzy-Functionality

Question

I am trying to create a Hibernate Search representation of the StingUtils containsIgnoreCase() method together with fuzzy-search matching.

Assume the user writes the letter "p", and they will get all matches that include the letter "p" (regardless whether the letter is located at the beginning, middle or end of the respective matches).

As they form words such as "Peter", they should also receive fuzzy-matches as e.g."Petar", "Petaer" and "Peder" as well.

I am using the custom query and index Analyzers provided in the great answer here, because I need minGramSize at 1 to allow for the autocomplete functionality, while at the same time I also expect multi-word user input separated by white spaces such as "EUR Account of Peter", which can be in different cases (lower or upper).

So a user should be able to type "AND" and receive the above example as a match.

Currently, I am using the following query:

  org.apache.lucene.search.Query fuzzySearchByName = qb.keyword().fuzzy()
                                                   .withEditDistanceUpTo(1).onField("name")
                                                   .matching(userInput).createQuery();
  booleanQuery.add(fuzzySearchByName, BooleanClause.Occur.MUST);

However, exact match cases do not receive presendence in the search results:

If we type "petar", we get the following results:

Petarr (non-exact match)
Petaer (non-exact match)

... 4. PETAR (exact match)

Same applies for user input of "peter", where the first result is "Petero", and the second is "Peter" (the second should be the first).

I also need to include only exact matches on multi-word queries - e.g. if I start writing "Account for...", I wish all the matched results to include the phrase "Account for" and eventually its fuzzy-related terms based on that phrase (basically the same as the containsIgnoreCase() method showed earlier on, just trying to add fuzzy support).

I guess however that this contradics with the minGramSize of 1 and the WhitespaceTokenizerFactory?

score 2 · Accepted Answer · edited Apr 25 '20 at 14:49

However, exact match cases do not receive presendence in the search results:

Just use two queries instead of one:

EDIT: you will also need to set up two separate fields for autocomplete and "exact" match; see my edit at the bottom.

  org.apache.lucene.search.Query exactSearchByName = qb.keyword().onField("name")
                                                   .matching(userInput).createQuery();
  org.apache.lucene.search.Query fuzzySearchByName = qb.keyword().fuzzy()
                                                   .withEditDistanceUpTo(1).onField("name")
                                                   .matching(userInput).createQuery();
  org.apache.lucene.search.Query searchByName = qb.boolean().should(exactSearchByName).should(fuzzySearchByName).createQuery();
  booleanQuery.add(searchByName, BooleanClause.Occur.MUST);

This will match documents that contain the user input exactly or approximately, so this will match the same documents as your example. However, documents that contain the user input exactly will match both queries, while documents that only contain something similar will only match the fuzzy query. As a result, exact matches will have a higher score and end up higher up in the result list.

If exact matches are not high enough, try adding a boost to the exactSearchByName query:

  org.apache.lucene.search.Query exactSearchByName = qb.keyword().onField("name")
                                                   .matching(userInput)
                                                   .boostedTo(4.0f)
                                                   .createQuery();

I guess however that this contradics with the minGramSize of 1 and the WhitespaceTokenizerFactory?

If you want to match documents that contain any word (but not necessarily all words) appearing in the user input, and to put documents containing more words higher in the result list, do what I explained above.

If you want to match documents that contain all words in the exact same order, use a KeywordTokenizerFactory (i.e. no tokenizing).

If you want to match documents that contain all words in any order, well... that's less obvious. There's no support for that in Hibernate Search (yet), so you will essentially have to build the query yourself. One hack that I've already seen is something like this:

Analyzer analyzer = fullTextSession.getSearchFactory().getAnalyzer( "myAnalyzer" );

QueryParser queryParser = new QueryParser( "name", analyzer );
queryParser.setOperator( Operator.AND ); // Match *all* terms
Query luceneQuery = queryParser.parse( userInput );

... but that will not generate fuzzy queries. If you want fuzzy queries, you can try to override some methods in a custom subclass of QueryParser. I didn't try this, but it might work:

public final class FuzzyQueryParser extends QueryParser {
    private final int maxEditDistance;
    private final int prefixLength;

    public FuzzyQueryBuilder(String fieldName, Analyzer analyzer, int maxEditDistance, int prefixLength) {
        super( fieldName, analyzer );
        this.maxEditDistance = maxEditDistance;
        this.prefixLength = prefixLength;
    }

    @Override
    protected Query newTermQuery(Term term) {
        return new FuzzyQuery( term, maxEditDistance, prefixLength );
    }
}

EDIT: With a minGramSize of 1, you will get lots of very frequent terms: single or two-character terms extracted from the beginning of words. It is likely to cause many unwanted matches that will be scored high (because the terms are frequent) and will probably drown exact matches.

First, you can try setting the similarity (~ scoring formula) to org.apache.lucene.search.similarities.BM25Similarity, which is better at ignoring very frequent terms. See here for the setting. That should improve scoring with the same analyzers.

Second, you can try setting up two fields instead of one: one field for fuzzy autocomplete and one for non-fuzzy, complete matches. That may improve the score of exact matches since there will be less meaningless terms indexed for the field used for exact matches. Just do this:

@Field(name = "name", analyzer = @Analyzer(definition = "text")
@Field(name = "name_autocomplete", analyzer = @Analyzer(definition = "edgeNgram")
private String name;

The analyzer "text" is just the analyzer "edgeNGram_query" from the answer you linked; just rename it.

The proceed with writing two queries instead of one as explained above, but make sure to target two different fields:

  org.apache.lucene.search.Query exactSearchByName = qb.keyword().onField("name")
                                                   .matching(userInput).createQuery();
  org.apache.lucene.search.Query fuzzySearchByName = qb.keyword().fuzzy()
                                                   .withEditDistanceUpTo(1).onField("name_autocomplete")
                                                   .matching(userInput).createQuery();
  org.apache.lucene.search.Query searchByName = qb.boolean().should(exactSearchByName).should(fuzzySearchByName).createQuery();
  booleanQuery.add(searchByName, BooleanClause.Occur.MUST);

Don't forget to reindex after those changes, of course.

Thanks for your great answer yrodiere, but unfortunately exact matching does not receive any presecendence, even when boosted. However, a more serious issue is that query "from bo" returns documents with names "BBS" and "EUR Load Testing". What could be causing this issue - is it something related to my original setup? — Petar Bivolarski, Apr 21 '20 at 06:46
"Load" starts with "Lo", which is within 1 edit distance of "Bo", so it matches. Same for "BBS" => "BB" => matches "Bo". That's fuzzy search for you... Regarding scoring, I updated my answer. — yrodiere, Apr 21 '20 at 07:38

Hibernate Search with Autocomplete and Fuzzy-Functionality

1 Answers1