Using Hibernate Search (5.8.2.Final) Query DSL to Elasticsearch server.
Given a field analyzer that does lowercase, standard stop-words, then a custom synonym with:
company => co
and finally, a custom stop-word:
co
And we've indexed a vendor name: Great Spaulding Company
, which boils down to 2 terms in Elasticsearch after synonyms and stop-words: great
and spaulding
.
I'm trying to build my query so that each term 'must' match, fuzzy or exact, depending on the term length.
I get the results I want except when 1 of the terms happens to be a synonym or stop-word and long enough that my code adds fuzziness to it, like company~1
, in which case, it is no longer seen as a synonym or stop-word and my query returns no match, since 'company' was never stored in the first place b/c it becomes 'co' and then removed as a stop word.
Time for some code. It may seem a bit hacky, but I've tried numerous ways and using simpleQueryString
with withAndAsDefaultOperator
and building my own phrase seems to get me the closest to the results I need (but I'm open to suggestions). I'm doing something like:
// assume passed in search String of "Great Spaulding Company"
String vendorName = "Great Spaulding Company";
List<String> vendorNameTerms = Arrays.asList(vendorName.split(" "));
List<String> qualifiedTerms = Lists.newArrayList();
vendorNameTerms.forEach(term -> {
int editDistance = getEditDistance(term); // 1..5 = 0, 6..10 = 1, > 10 = 2
int prefixLength = getPrefixLength(term); //appears of no use with simpleQueryString
String fuzzyMarker = editDistance > 0 ? "~" + editDistance : "";
qualifiedTerms.add(String.format("%s%s", term, fuzzyMarker));
});
// join my terms back together with their optional fuzziness marker
String phrase = qualifiedTerms.stream().collect(Collectors.joining(" "));
bool.should(
qb.simpleQueryString()
.onField("vendorNames.vendorName")
.withAndAsDefaultOperator()
.matching(phrase)
.createQuery()
);
As I said above, I'm finding that as long as I don't add any fuzziness to a possible synonym or stop-word, the query finds a match. So these phrases return a match:
"Great Spaulding~1"
or "Great Spaulding~1 Co"
or "Spaulding Co"
But since my code doesn't know what terms are synonyms or stop-words, it blindly looks at term length and says, oh, 'Company' is greater than 5 characters, I'll make it fuzzy, it builds these sorts of phrases which are NOT returning a match:
"Great Spaulding~1 Company~1"
or "Great Company~1"
- Why is Elasticsearch not processing
Company~1
as a synonym? - Any idea on how I can make this work with simpleQueryString or another DSL query?
- How is everyone handling fuzzy searching on text that may contain stopwords?
[Edit] Same issue happens with punctuation that my analyzer would normally remove. I cannot include any punctuation in the fuzzy search string in my query b/c the ES analyzer doesn't seem to treat it as it would non-fuzzy and I don't get a match result.
Example based on above search string: Great Spaulding Company.,
gets built in my code to the phrase Great Spaulding~1 Company.,~1
and ES doesn't remove the punctuation or recognize the synonym word Company
I'm going to try a hack of calling ES _analyze REST api in order for it to tell me what tokens I should include in the query, although this will add overhead to every query I build. Similar to http://localhost:9200/myEntity/_analyze?analyzer=vendorNameAnalyzer&text=Great Spaulding Company.,
produces 3 tokens: great
, spaulding
and company
.