How to support tokenized and untokenized search at the same time

Question

I try to make hibernate search to support both tokenized and untokenized search(pardon me if I use the wrong term here). An example is as following.

I have a list of entities of the following type.

@Entity
@Indexed
@NormalizerDef(name = "lowercase",
    filters = {
        @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
        @TokenFilterDef(factory = LowerCaseFilterFactory.class)
    }
)
public class Deal {
    //other fields omitted for brevity purposes

    @Field(store = Store.YES)
    @Field(name = "name_Sort", store = Store.YES, normalizer= @Normalizer(definition="lowercase"))
    @SortableField(forField = "name_Sort")
    @Column(name = "NAME")
    private String name = "New Deal";

    //Getters/Setters omitted here
}

I also used the keyword method to build the query builder shown as follows. The getSearchableFields method returns a list of searchable fields. In the this example, "name" will be in this returned list as the field name in Deal is searchable.

    protected Query inputFilterBuilder() {
        return queryBuilder.keyword()
            .wildcard().onFields(getSearchableFields())
            .matching("*" + searchRequest.getQuery().toLowerCase() + "*").createQuery();
    }

This setup works fine when I only use an entire words to search. For example, if I have two Deal entity, one's name is "Practical Concrete Hat" and the other one's name is "Practical Cotton Cheese". When searching by "Practical", I get these two entities back. But when searching by "Practical Co", I get 0 entity back. The reason is because the field name is tokenized and "Practical Co" is not a key word.

My question is how to support both search at the same time so these 2 entities are returned if searching by "Practical" or "Practical Co".

I read through the official hibernate search documentation and my hunch is that I should add one more field that is for untokenized search. Perhaps the way I construct the query builder needs to be updated as well?

Update

Not working solution using SimpleQueryString.

Based on the provided answer, I've written the following query builder logic. However, it doesn't work.

    protected Query inputFilterBuilder() {
        String[] searchableFields = getSearchableFields();
        if(searchableFields.length == 0) {
            return queryBuilder.simpleQueryString().onField("").matching("").createQuery();
        }
        SimpleQueryStringMatchingContext simpleQueryStringMatchingContext = queryBuilder.simpleQueryString().onField(searchableFields[0]);
        for(int i = 1; i < searchableFields.length; i++) {
            simpleQueryStringMatchingContext = simpleQueryStringMatchingContext.andField(searchableFields[i]);
        }
        return simpleQueryStringMatchingContext
            .matching("\"" + searchRequest.getQuery() + "\"").createQuery();
    }

Working solution using separate analyzer for query and phrase queries.

I found from the official documentation that we can use phrase queries to search for more than one word. So I wrote the following query builder method.

    protected Query inputFilterBuilder() {
        String[] searchableFields = getSearchableFields();
        if(searchableFields.length == 0) {
            return queryBuilder.phrase().onField("").sentence("").createQuery();
        }
        PhraseMatchingContext phraseMatchingContext = queryBuilder.phrase().onField(searchableFields[0]);
        for(int i = 1; i < searchableFields.length; i++) {
            phraseMatchingContext = phraseMatchingContext.andField(searchableFields[i]);
        }
        return phraseMatchingContext.sentence(searchRequest.getQuery()).createQuery();
    }

This does not work for search using more than one word with a space in between. Then I added separate analyzers for indexing and querying as suggested, all of a sudden, it works.

Analyzers definitons:

@AnalyzerDef(name = "edgeNgram", tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
    filters = {
        @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
        @TokenFilterDef(factory = LowerCaseFilterFactory.class),
        @TokenFilterDef(factory = EdgeNGramFilterFactory.class,
                        params = {
                            @Parameter(name = "minGramSize", value = "1"),
                            @Parameter(name = "maxGramSize", value = "10")
                        })
    })
@AnalyzerDef(name = "edgeNGram_query", tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
    filters = {
        @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
        @TokenFilterDef(factory = LowerCaseFilterFactory.class)
    })

Annotation for Deal name field:

    @Field(store = Store.YES, analyzer = @Analyzer(definition = "edgeNgram"))
    @Field(name = "edgeNGram_query", store = Store.YES, analyzer = @Analyzer(definition = "edgeNGram_query"))
    @Field(name = "name_Sort", store = Store.YES, normalizer= @Normalizer(definition="lowercase"))
    @SortableField(forField = "name_Sort")
    @Column(name = "NAME")
    private String name = "New Deal";

Code that override name field's analyzer to use the query analyzer

            String[] searchableFields = getSearchableFields();
            if(searchableFields.length > 0) {
                EntityContext entityContext = fullTextEntityManager.getSearchFactory()
                    .buildQueryBuilder().forEntity(this.getClass().getAnnotation(SearchType.class).clazz()).overridesForField(searchableFields[0], "edgeNGram_query");

                for(int i = 1; i < searchableFields.length; i++) {
                    entityContext.overridesForField(searchableFields[i], "edgeNGram_query");
                }
                queryBuilder = entityContext.get();
            }

Follow up question Why does the above tweak actually works?

score 1 · Accepted Answer · answered Mar 27 '19 at 07:38

1

Your problem here is the wildcard query. Wildcard queries do not support tokenization: they only work on single tokens. In fact, they don't even support normalization, which is why you had to lowercase the user input yourself...

The solution would not be to mix tokenized and untokenized search (that's possible, but wouldn't really solve your problem). The solution would be to forget about wildcard queries altogether and use an edgengram filter in your analyzer.

See this answer for an extended explanation.

If you use the ELasticsearch integration, you will have to rely on a hack to make the "query-only" analyzer work properly. See here.

answered Mar 27 '19 at 07:38

yrodiere

9,280
1
13
35

Thanks for the reply! I've read through the related links in your answer and noticed your comment about using SimpleQueryString in this link. https://stackoverflow.com/questions/43044350/hibernate-search-ngram-analyzer-with-mingramsize-1/43047342#43047342 The entity in my use case will also have multiple searchable fields, so can I use SimpleQueryString to avoid writing custom analyzer annotation at each searchable field? – Lyn Mar 27 '19 at 20:26
I re-wrote the query builder logic shown in my question post. By adding extra double quote, I use the phrase operator of SimpleQueryParser so that the query will return results containing exactly the query phrase in any searchable fields. However, it does not work as expected, searching with phrase with a space in it still returns nothing. Is it because the way these searchable fields are indexed? – Lyn Mar 27 '19 at 21:44
The comment section has a small character count limit so I added some updates in my question post. – Lyn Mar 27 '19 at 23:23
Phrase queries look for an exact sequence of words in the document, they are probably not what you're looking for. From what I understand my suggestion works for you; if you have another problem, please ask another question, and detail what you mean by "doesn't work" (exception? no results? wrong results? if so, what was expected, what did you get?) – yrodiere Apr 01 '19 at 06:47
Ah, ok, I'll ask a new question about this, thanks! – Lyn Apr 01 '19 at 18:17

How to support tokenized and untokenized search at the same time

1 Answers1