How to handle synonyms and stop words when building a fuzzy query with Hibernate Search Query DSL

Question

Using Hibernate Search (5.8.2.Final) Query DSL to Elasticsearch server.

Given a field analyzer that does lowercase, standard stop-words, then a custom synonym with:

company => co

and finally, a custom stop-word:

co

And we've indexed a vendor name: Great Spaulding Company, which boils down to 2 terms in Elasticsearch after synonyms and stop-words: great and spaulding.

I'm trying to build my query so that each term 'must' match, fuzzy or exact, depending on the term length.

I get the results I want except when 1 of the terms happens to be a synonym or stop-word and long enough that my code adds fuzziness to it, like company~1, in which case, it is no longer seen as a synonym or stop-word and my query returns no match, since 'company' was never stored in the first place b/c it becomes 'co' and then removed as a stop word.

Time for some code. It may seem a bit hacky, but I've tried numerous ways and using simpleQueryString with withAndAsDefaultOperator and building my own phrase seems to get me the closest to the results I need (but I'm open to suggestions). I'm doing something like:

// assume passed in search String of "Great Spaulding Company"
String vendorName = "Great Spaulding Company";  
List<String> vendorNameTerms = Arrays.asList(vendorName.split(" "));
List<String> qualifiedTerms = Lists.newArrayList();

vendorNameTerms.forEach(term -> {
    int editDistance = getEditDistance(term); // 1..5 = 0, 6..10 = 1, > 10 = 2 
    int prefixLength = getPrefixLength(term); //appears of no use with simpleQueryString

    String fuzzyMarker = editDistance > 0 ? "~" + editDistance : "";
    qualifiedTerms.add(String.format("%s%s", term, fuzzyMarker));
});

// join my terms back together with their optional fuzziness marker
String phrase = qualifiedTerms.stream().collect(Collectors.joining(" "));

bool.should(
        qb.simpleQueryString()
                .onField("vendorNames.vendorName")
                .withAndAsDefaultOperator()
                .matching(phrase)
                .createQuery()
);

As I said above, I'm finding that as long as I don't add any fuzziness to a possible synonym or stop-word, the query finds a match. So these phrases return a match: "Great Spaulding~1" or "Great Spaulding~1 Co" or "Spaulding Co"

But since my code doesn't know what terms are synonyms or stop-words, it blindly looks at term length and says, oh, 'Company' is greater than 5 characters, I'll make it fuzzy, it builds these sorts of phrases which are NOT returning a match: "Great Spaulding~1 Company~1" or "Great Company~1"

Why is Elasticsearch not processing Company~1 as a synonym?
Any idea on how I can make this work with simpleQueryString or another DSL query?
How is everyone handling fuzzy searching on text that may contain stopwords?

[Edit] Same issue happens with punctuation that my analyzer would normally remove. I cannot include any punctuation in the fuzzy search string in my query b/c the ES analyzer doesn't seem to treat it as it would non-fuzzy and I don't get a match result.

Example based on above search string: Great Spaulding Company., gets built in my code to the phrase Great Spaulding~1 Company.,~1 and ES doesn't remove the punctuation or recognize the synonym word Company

I'm going to try a hack of calling ES _analyze REST api in order for it to tell me what tokens I should include in the query, although this will add overhead to every query I build. Similar to http://localhost:9200/myEntity/_analyze?analyzer=vendorNameAnalyzer&text=Great Spaulding Company., produces 3 tokens: great, spaulding and company.

score 0 · Answer 1 · answered Jul 30 '18 at 08:14

Why is Elasticsearch not processing Company~1 as a synonym?

I'm going to guess it's because fuzzy queries are "term-level" queries, which means they operate on exact terms instead of analyzed text. If your term, once analyzed, resolved to multiple tokens, I don't think it would be easy to define an acceptable behavior for a fuzzy queries.

There's a more detailed explanation there (I believe it still applies to the Lucene version used in Elasticsearch 5.6).

Any idea on how I can make this work with simpleQueryString or another DSL query? How is everyone handling fuzzy searching on text that may contain stopwords?

You could try reversing your synonym: use co => company instead of company => co, so that a query such as compayn~1 will match even if "compayn" is not analyzed. But that's not a satisfying solution, of course, since other example requiring analysis still won't work, such as Company~1.

Below are alternative solutions.

Solution 1: "match" query with fuzziness

This article describes a way to perform fuzzy searches, and in particular explains the difference between several types of fuzzy queries.

Unfortunately it seems that fuzzy queries in "simple query string" queries are translated in the type of query that does not perform analysis.

However, depending on your requirements, the "match" query may be enough. In order to access all the settings provided by Elasticsearch, you will have to fall back to native query building:

    QueryDescriptor query = ElasticsearchQueries.fromJson(
            "{ 'query': {"
                + "'match' : {"
                    + "'vendorNames.vendorName': {"
                        // Not that using a proper JSON framework would be better here, to avoid problems with quotes in the terms
                        + "'query': '" + userProvidedTerms + "',"
                        + "'operator': 'and',"
                        + "'fuzziness': 'AUTO'"
                    + "}"
                + "}"
            + " } }"
    );
    List<?> result = session.createFullTextQuery( query ).list();

See this page for details about what "AUTO" means in the above example.

Note that until Hibernate Search 6 is released, you can't mix native queries like shown above with the Hibernate Search DSL. Either you use the DSL, or native queries, but not both in the same query.

Solution 2: ngrams

In my opinion, your best bet when the queries originate from your users, and those users are not Lucene experts, is to avoid parsing the queries altogether. Query parsing involves (at least in part) text analysis, and text analysis is best left to Lucene/Elasticsearch.

Then all you can do is configure the analyzers.

One way to add "fuzziness" with these tools would be to use an NGram filter. With min_gram = 3 and max_gram = 3, for example:

An indexed string such as "company" would be indexed as ["com", "omp", "mpa", "pan", "any"]
A query such as "compayn", once analyzed, would be translated to (essentially com OR omp OR mpa OR pay OR ayn
Such a query would potentially match a lot of documents, but when sorting by score, the document for "Great Spaulding Company" would come up to the top, because it matches almost all of the ngrams.

I used parameter values min_gram = 3 and max_gram = 3 for the example, but in a real world application something like min_gram = 3 and max_gram = 5 would work better, since the added, longer ngrams would give a better score to search terms that match a longer part of the indexed terms.

Of course if you can't sort by score, of if you can't accept too many trailing partial matches in the results, then this solution won't work for you.

Thanks Yoann, you've given a lot of good information here. In our situation, we're migrating existing code with very particular exact and probable matching rules. We have 4500 unit tests and a good number of these test 'Exact' and 'Probable' matching rules on names, addresses, phoneNumbers etc. in various combinations. The way I solved it (without changing the existing rules we have) was to upgrade to Hibernate ORM 5.3.3 and Hibernate-Search 5.10 so I could use the Elasticsearch RestClient to issue an _analyze call directly to ES and then use the returned tokens to build my queries with. — Rick Gagne, Aug 02 '18 at 11:52
I know this probably isn't ideal b/c in the big picture, ES ends up analyzing 1 additional time per Query that I build, but it allows me to know, before I build the Query, whether an input string might analyze down to nothing or not. For instance, a vendorName of 'Company' gets completely removed after analyzing, so I don't even want to include it in my Query. It also lets ES determine the tokens instead of me trying to parse the string and manually add fuzziness to each one. I still add fuzziness but can now use the proper HibernateSearch .keyword().fuzzy() syntax. — Rick Gagne, Aug 02 '18 at 12:01

How to handle synonyms and stop words when building a fuzzy query with Hibernate Search Query DSL

1 Answers1

Solution 1: "match" query with fuzziness

Solution 2: ngrams