3

I am using Hibernate Search with spring-boot. I have requirement that user will have search operators to perform the following on the establishment name:

  1. Starts with a word

.Ali --> Means the phrase should strictly start with Ali, which means AlAli should not return in the results

query = queryBuilder.keyword().wildcard().onField("establishmentNameEn")
                        .matching(term + "*").createQuery();

It returning mix result containing term in mid, start or in end not as per the above requirement

  1. Ends with a word

Kamran. --> Means it should strictly end end Kamran, meaning that Kamranullah should not be returned in the results

query = queryBuilder.keyword().wildcard().onField("establishmentNameEn")
                        .matching("*"+term).createQuery();

As per documentation, its not a good idea to put “*” in start. My question here is: how can i achieve the expected result

My domain class and analyzer:

 @AnalyzerDef(name = "english", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
        @TokenFilterDef(factory = StandardFilterFactory.class),
        @TokenFilterDef(factory = LowerCaseFilterFactory.class), })
@Indexed
@Entity
@Table(name = "DIRECTORY")
public class DirectoryEntity {
@Analyzer(definition = "english")
@Field(store = Store.YES)
@Column(name = "ESTABLISHMENT_NAME_EN")
private String establishmentNameEn;

getter and setter
}
Kamran Ullah
  • 101
  • 9
  • A wild card at the start is for every datastore not a good idea because no index will be used. However if your requirement is like this then you have to do it anyway. – Simon Martinelli Dec 30 '19 at 09:54
  • @SimonMartinelli i am not able to achieve this: .Ali --> Means the phrase should strictly start with Ali, which means AlAli should not return in the results. Even i tried: queryBuilder.phrase().onField("establishmentNameEn") .sentence("*Ali Hassan").createQuery(); But it return results in which Ali is in mid or in end. I want result to start with Ali – Kamran Ullah Dec 30 '19 at 10:50

1 Answers1

4

Two problems here:

Tokenizing

You're using a tokenizer, which means your searches will work with words, not with the full string you indexed. This explains that you're getting matches on terms in the middle of the sentence.

This can be solved by creating a separate field for these special begin/end queries, and using an analyzer with the KeywordTokenizer (which is a no-op).

For example:

 @AnalyzerDef(name = "english", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
        @TokenFilterDef(factory = StandardFilterFactory.class),
        @TokenFilterDef(factory = LowerCaseFilterFactory.class), })
 @AnalyzerDef(name = "english_beginEnd", tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class), filters = {
        @TokenFilterDef(factory = StandardFilterFactory.class),
        @TokenFilterDef(factory = LowerCaseFilterFactory.class), })
@Indexed
@Entity
@Table(name = "DIRECTORY")
public class DirectoryEntity {
@Analyzer(definition = "english")
@Field(store = Store.YES)
@Field(name = "establishmentNameEn_beginEnd", store = Store.YES, analyzer = @Analyzer(definition = "english_beginEnd"))
@Column(name = "ESTABLISHMENT_NAME_EN")
private String establishmentNameEn;

getter and setter
}

Query analysis and performance

The wildcard query does not trigger analysis of the entered text. This will cause unexpected behavior. For example if you index "Ali", then search for "ali", you will probably get a result, but if you search for "Ali" you won't: the text was analyzed and indexed as "ali", which doesn't exactly match "Ali".

Additionally, as you are aware, a leading wildcard is very, very bad performance wise.

If your field has a reasonable length (say, less than 30 characters), I would recommend to use the "edge-ngram" analyzer instead; you will find an explanation here: Hibernate Search: How to use wildcards correctly?

Note that you will still need to use the KeywordTokenizer (unlike the example I linked).

This will take care of the "match the beginning of the text" query, but not the "match the end of the text" query.

To address that second query, I would create a separate field and a separate analyzer, similar to the one used for the first query, the only difference being that you insert a ReverseStringFilterFactory before the EdgeNGramFilterFactory. This will reverse the text before indexing ngrams, which should lead to the desired behavior. Do not forget to also use a separate query analyzer for this field, one that reverses the string.

yrodiere
  • 9,280
  • 1
  • 13
  • 35