Whitespace tokenizer not working when using simple query string

Question

I first implemented query search using SimpleQueryString shown as follows.

Entity Definition

@Entity
@Indexed
@AnalyzerDef(name = "whitespace", tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
    filters = {
        @TokenFilterDef(factory = LowerCaseFilterFactory.class),
        @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class)
    })

public class AdAccount implements SearchableEntity, Serializable {

    @Id
    @DocumentId
    @Column(name = "ID")
    @GeneratedValue(strategy = GenerationType.AUTO)
    private Long id;

    @Field(store = Store.YES, analyzer = @Analyzer(definition = "whitespace"))
    @Column(name = "NAME")
    private String name;

    //other properties and getters/setters
}

I use the white space tokenizer factory here because the default standard analyzer ignores special characters, which is not ideal in my use case. The document I referred to is https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-WhiteSpaceTokenizer. In this document it states that Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens.

SimpleQueryString Method

protected Query inputFilterBuilder() {
    SimpleQueryStringMatchingContext simpleQueryStringMatchingContext = queryBuilder.simpleQueryString().onField("name");

    return simpleQueryStringMatchingContext
        .withAndAsDefaultOperator()
        .matching(searchRequest.getQuery() + "*").createQuery();
}

searchRequest.getQuery() returns the search query string, then I append the prefix operator in the end so that it supports prefix query.

However, this does not work as expected with the following example. Say I have an entity whose name is "AT&T Account", when searching with "AT&", it does not return this entity.

I then made the following changes to directly use a white space analyzer. This time searching with "AT&" works as expected. But the search is case sensitive now, i.e, searching with "at&" returns nothing now.

@Field
@Analyzer(impl = WhitespaceAnalyzer.class)
@Column(name = "NAME")
private String name;

My questions are:

Why doesn't it work when I use the white space factory in my first attempt? I assume using the factory versus using the actual analyzer implementation is different?
How to make my search case-insensitive if I use the @Analyzer annotation as in my second attempt?

Does it work when searching for `AT*` (i.e. without the `&`)? — Guillaume Smet, Apr 12 '19 at 10:03

score 0 · Accepted Answer · answered Apr 15 '19 at 09:34

Why doesn't it work when I use the white space factory in my first attempt? I assume using the factory versus using the actual analyzer implementation is different?

Wildcard and prefix queries (the one you're using when you add a * suffix in your query string) do not apply analysis, ever. Which means your lowercase filter is not applied to your search query, but it has been applied to your indexed text, which means it will never match: AT&* does not match the indexed at&t.

Using the @Analyzer annotation only worked because you removed the lowercasing at index time. With this analyzer, you ended up with AT&T (uppercase) in the index, and AT&* does match the indexed AT&T. It's just by chance, though: if you index At&t, you will end up with At&t in the index and you'll end up with the same problem.

How to make my search case-insensitive if I use the @Analyzer annotation as in my second attempt?

As I mentioned above, the @Analyzer annotation is not the solution, you actually made your search worse.

There is no built-in solution to make wildcard and prefix queries apply analysis, mainly because analysis could remove pattern characters such as ? or *, and that would not end well.

You could restore your initial analyzer, and lowercase the query yourself, but that will only get you so far: ascii folding and other analysis features won't work.

The solution I generally recommend is to use an edge-ngrams filter. The idea is to index every prefix of every word, so "AT&T Account" would get indexed as the terms a, at, at&, at&t, a, ac, acc, acco, accou, accoun, account and a search for "at&" would return the correct results even without a wildcard.

See this answer for a more extensive explanation.

If you use the ELasticsearch integration, you will have to rely on a hack to make the "query-only" analyzer work properly. See here.

Thank you for the detailed explanation! – Lyn Apr 16 '19 at 17:30 — Lyn, Apr 16 '19 at 17:30

Whitespace tokenizer not working when using simple query string

1 Answers1