Hibernate Search: How to use wildcards correctly?

Question

I have the following query to search patients by full name, for an specific medical center:

MustJunction mj = qb.bool().must(qb.keyword()
    .onField("medicalCenter.id")
    .matching(medicalCenter.getId())
    .createQuery());
for(String term: terms)
    if(!term.equals(""))
       mj.must(qb.keyword()
       .onField("fullName")
       .matching(term+"*")
       .createQuery());

And it is working perfectly, but only if the user types the full firstname and/or lastname of the patient.

However I would like to make if work even if the user types a part of the firstname or lastname.

For example, if there's a patient called "Bilbo Baggins" I would like the search to find him, when the user types "Bilbo Baggins, "Bilbo, "Baggins", or even if he only types "Bil" or "Bag"

To achieve this I modified the above query as follows:

MustJunction mj = qb.bool().must(qb.keyword()
    .onField("medicalCenter.id")
    .matching(medicalCenter.getId())
    .createQuery());
for(String term: terms)
    if(!term.equals(""))
       mj.must(qb.keyword()
       .wildcard()
       .onField("fullName")
       .matching(term+"*")
       .createQuery());

Note how I added the wildcard() function before the call to onField()

However, this breaks the search and returns no results. What am I doing wrong?

Wildcard queries aren't analyzed. Simple solution would be to lowercase `term`. — femtoRgon, Oct 24 '17 at 03:24

yrodiere · Accepted Answer · 2021-08-06T07:08:11.270

Updated answer for Hibernate Search 6

Short answer: don't use wildcard queries, use a custom analyzer with an EdgeNGramFilterFactory. Also, don't try to analyze the query yourself (that's what you did by splitting the query into terms): Lucene will do it much better (with a WhitespaceTokenizerFactory, an ASCIIFoldingFilterFactory and a LowercaseFilterFactory in particular).

Long answer:

Wildcard queries are useful as quick and easy solutions to one-time problems, but they are not very flexible and reach their limits quite quickly. In particular, as @femtoRgon mentioned, these queries are not analyzed (at least not completely, and not with every backend), so an uppercase query won't match a lowercase name, for instance.

The classic solution to most problems in the Lucene/Elasticsearch world is to use specially-crafted analyzers at index time and query time (not necessarily the same). In your case, you will want to use this kind of analyzer (one for indexing, one for searching):

Lucene:

public class MyAnalysisConfigurer implements LuceneAnalysisConfigurer {
    @Override
    public void configure(LuceneAnalysisConfigurationContext context) {
        context.analyzer( "autocomplete_indexing" ).custom()
                .tokenizer( WhitespaceTokenizerFactory.class )
                // Lowercase all characters
                .tokenFilter( LowerCaseFilterFactory.class )
                // Replace accented characters by their simpler counterpart (è => e, etc.)
                .tokenFilter( ASCIIFoldingFilterFactory.class )
                // Generate prefix tokens
                .tokenFilter( EdgeNGramFilterFactory.class )
                        .param( "minGramSize", "1" )
                        .param( "maxGramSize", "10" );
        // Same as "autocomplete-indexing", but without the edge-ngram filter
        context.analyzer( "autocomplete_search" ).custom()
                .tokenizer( WhitespaceTokenizerFactory.class )
                // Lowercase all characters
                .tokenFilter( LowerCaseFilterFactory.class )
                // Replace accented characters by their simpler counterpart (è => e, etc.)
                .tokenFilter( ASCIIFoldingFilterFactory.class );
    }
}

Elasticsearch:

public class MyAnalysisConfigurer implements ElasticsearchAnalysisConfigurer {
    @Override
    public void configure(ElasticsearchAnalysisConfigurationContext context) {
        context.analyzer( "autocomplete_indexing" ).custom()
                .tokenizer( "whitespace" )
                .tokenFilters( "lowercase", "asciifolding", "autocomplete_edge_ngram" );
        context.tokenFilter( "autocomplete_edge_ngram" )
                .type( "edge_ngram" )
                .param( "min_gram", 1 )
                .param( "max_gram", 10 );
        // Same as "autocomplete_indexing", but without the edge-ngram filter
        context.analyzer( "autocomplete_search" ).custom()
                .tokenizer( "whitespace" )
                .tokenFilters( "lowercase", "asciifolding" );
    }
}

The indexing analyzer will transform "Mauricio Ubilla Carvajal" to this list of tokens:

m
ma
mau
maur
mauri
mauric
maurici
mauricio
u
ub
...
ubilla
c
ca
...
carvajal

And the query analyzer will turn the query "mau UB" into ["mau", "ub"], which will match the indexed name (both tokens are present in the index).

Note that you'll obviously have to assign the analyzers to the field. In Hibernate Search 6 it's easy, as you can assign a searchAnalyzer to a field, separately from the indexing analyzer:

@FullTextField(analyzer = "autocomplete_indexing", searchAnalyzer = "autocomplete_search")

Then you can easily search with, say, a simpleQueryString predicate:

List<Patient> hits = searchSession.search( Patient.class )
        .where( f -> f.simpleQueryString().field( "fullName" )
                .matching( "mau + UB" ) )
        .fetchHits( 20 );

Or if you don't need extra syntax and operators, a match predicate should do:

List<Patient> hits = searchSession.search( Patient.class )
        .where( f -> f.match().field( "fullName" )
                .matching( "mau UB" ) )
        .fetchHits( 20 );

Original answer for Hibernate Search 5

Short answer: don't use wildcard queries, use a custom analyzer with an EdgeNGramFilterFactory. Also, don't try to analyze the query yourself (that's what you did by splitting the query into terms): Lucene will do it much better (with a WhitespaceTokenizerFactory, an ASCIIFoldingFilterFactory and a LowercaseFilterFactory in particular).

Long answer:

Wildcard queries are useful as quick and easy solutions to one-time problems, but they are not very flexible and reach their limits quite quickly. In particular, as @femtoRgon mentioned, these queries are not analyzed, so an uppercase query won't match a lowercase name, for instance.

The classic solution to most problems in the Lucene world is to use specially-crafted analyzers at index time and query time (not necessarily the same). In your case, you will want to use this kind of analyzer when indexing:

    @AnalyzerDef(name = "edgeNgram",
        tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
        filters = {
                @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class), // Replace accented characeters by their simpler counterpart (è => e, etc.)
                @TokenFilterDef(factory = LowerCaseFilterFactory.class), // Lowercase all characters
                @TokenFilterDef(
                        factory = EdgeNGramFilterFactory.class, // Generate prefix tokens
                        params = {
                                @Parameter(name = "minGramSize", value = "1"),
                                @Parameter(name = "maxGramSize", value = "10")
                        }
                )
        })

And this kind when querying:

@AnalyzerDef(name = "edgeNGram_query",
    tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
    filters = {
            @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class), // Replace accented characeters by their simpler counterpart (è => e, etc.)
            @TokenFilterDef(factory = LowerCaseFilterFactory.class) // Lowercase all characters
    })

The index analyzer will transform "Mauricio Ubilla Carvajal" to this list of tokens:

m
ma
mau
maur
mauri
mauric
maurici
mauricio
u
ub
...
ubilla
c
ca
...
carvajal

And the query analyzer will turn the query "mau UB" into ["mau", "ub"], which will match the indexed name (both tokens are present in the index).

Note that you'll obviously have to assign the analyzer to the field. For the indexing part, it's done using the @Analyzer annotation. For the query part, you'll have to use overridesForField on the query builder as shown here:

QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Hospital.class)
    .overridesForField( "name", "edgeNGram_query" )
    .get();
// Then it's business as usual

Also note that, in Hibernate Search 5, Elasticsearch analyzer definitions are only generated by Hibernate Search if they are actually assigned to an index. So the query analyzer definition will not, by default, be generated, and Elasticsearch will complain that it does not know the analyzer. Here is a workaround: https://discourse.hibernate.org/t/cannot-find-the-overridden-analyzer-when-using-overridesforfield/1043/4?u=yrodiere

The reason why I'm splitting the query into terms, is because I want all the terms to be mandatory. For example, if I search for "Mauricio Ubilla" I want results matching both terms, (all the patients with first name "Mauricio" AND last name "Ubilla" not all the patients with first name "Mauricio" OR last name "Ubilla"). If there's another way to do this, I would appreciate if you could tell me. — Mauricio Ubilla Carvajal, Oct 24 '17 at 14:33
To use the filters you mentioned, I have to import Solr jars right? I've seen a lot of examples using this kind of solutions, and I don't know why they are included as "optional" in the Hibernate Search bundle, when they seem to be the pretty much the preferred way to solve this. I'm fixing some dependency problems to try your solution, Thanks a lot. — Mauricio Ubilla Carvajal, Oct 24 '17 at 14:39
Thank you very much! It worked perfectly. I only have one last question: I want to define the "edgeNGram_query" Analyzer inside my Datasource class, but for some reason, when I place the @AnalyzerDef annotation inside it, the query code doesn't find its definition (at runtime). However, if I place the definition inside my Patient class, it works fine. Do you have any idea of why Hibernate search finds the Analyzer inside my Patient class, but not inside my Datasource class? — Mauricio Ubilla Carvajal, Oct 24 '17 at 20:49
"The reason why I'm splitting the query ..." => You might also want to have a look at [simple query strings](https://docs.jboss.org/hibernate/search/5.8/reference/en-US/html_single/#_simple_query_string_queries) introduced in 5.8, in particular the `withAndAsDefaultOperator` method. It's not a perfect fit for your use case (since it allows users to use their own operators such as `*` or `+`), but it might work for you. — yrodiere, Oct 25 '17 at 08:14
"I want to define the edgeNGram_query Analyzer inside my Datasource class" => I don't know what your Datasource class is, but be aware that AnalyzerDef annotation are only inspected on indexed classes and packages of indexed classes. — yrodiere, Oct 25 '17 at 08:15
In 5.8, you can also define analyzers programmatically, which makes more sense when the definition is not related to one indexed class in particular: https://docs.jboss.org/hibernate/search/5.8/reference/en-US/html_single/#section-programmatic-analyzer-definition — yrodiere, Oct 25 '17 at 08:17
"I don't know what your Datasource class is" A class whose purpose is to perform -hibernate- connections to the DB, and that does not represent an Entity. "be aware that AnalyzerDef annotation are only inspected on indexed classes and packages of indexed classes" that's the answer to my question :) — Mauricio Ubilla Carvajal, Oct 25 '17 at 14:15
" A class whose purpose is to perform -hibernate- connections to the DB" Right, that one I know. I thought maybe it was part of your business model, since that's the only way analyzerDef could work :) — yrodiere, Oct 26 '17 at 07:07

Hibernate Search: How to use wildcards correctly?

1 Answers1

Updated answer for Hibernate Search 6

Original answer for Hibernate Search 5

Linked