Using Exact Prefix/MatchPhrase Prefix Queries with Ngram Filter

Question

My goal is to search query text having length one or two character long. This is my setting for the index.

"settings" : {
      "index" : {
        "number_of_shards" : "5",
        "provided_name" : "my_user",
        "analysis" : {
          "filter" : {
            "ngrammed" : {
              "type" : "ngram",
              "min_gram" : "3",
              "max_gram" : "50"
            }
          },
          "analyzer" : {
            "ngrammed_ci" : {
              "filter" : [
                "lowercase",
                "ngrammed"
              ],
              "type" : "custom",
              "tokenizer" : "standard"
            },
            "keyword_ci" : {
              "filter" : [
                "lowercase"
              ],
              "type" : "custom",
              "tokenizer" : "keyword"
            }
          }
        }
      }
    }

I have a set of users with a display name field with the following analyzers. Say if I have a couple of users with names like Allen, Alec, Kimball, Polly The problem I am facing is that when I search with a 2 character length query string like al along with Allen & Alec, it matches with Kimball as well since the ngram filter tokenizes Kimball as all in the inverted index. I am trying to avoid this scenario. Also wanted to know if there is anyway to implement this functionality without changing anythin on the Index side of things and make the changes only for query side.

"user_display_name" : {
  "type" : "text",
  "fields" : {
    "ci" : {
    "type" : "text",
    "analyzer" : "keyword_ci"
    }
  "cs" : {
    "type" : "keyword"
    }
  },
  "analyzer" : "ngrammed_ci",
  "search_analyzer" : "standard"
}

That is a very large index. It will be slow no matter how you query it. — Nice-Guy, Aug 07 '20 at 18:31

score 0 · Answer 1 · answered Aug 08 '20 at 01:43

In your case, you need ngrams that start at the beginning of words. When that is the case, it makes more sense to use edge ngrams instead.

Adding a working example with index mapping, index data, search query, and search result.

Mapping:

{
  "settings": {
    "analysis": {
      "filter": {
        "ngrammed": {
          "type": "edge_ngram",     <<-- note this
          "min_gram": "2",
          "max_gram": "50"
        }
      },
      "analyzer": {
        "ngrammed_ci": {
          "filter": [
            "lowercase",
            "ngrammed"
          ],
          "type": "custom",
          "tokenizer": "standard"
        },
        "keyword_ci": {
          "filter": [
            "lowercase"
          ],
          "type": "custom",
          "tokenizer": "keyword"
        }
      }
    },
    "index.max_ngram_diff": 50
  },
  "mappings": {
    "properties": {
      "user_display_name": {
        "type": "text",
        "fields": {
          "ci": {
            "type": "text",
            "analyzer": "keyword_ci"
          },
          "cs": {
            "type": "keyword"
          }
        },
        "analyzer": "ngrammed_ci",
        "search_analyzer": "standard"
      }
    }
  }
}

Following tokens will be generated:

GET/_analyze

{
  "analyzer" : "ngrammed_ci",
  "text" : "Allen"
}

"tokens": [
    {
      "token": "al",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "all",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "alle",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "allen",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]

Index Data:

{ "user_display_name" : "Allen" }
{ "user_display_name" : "Alec" }
{ "user_display_name" : "Kimball" }
{ "user_display_name" : "Polly" }

Search Query:

    {
  "query": {
    "query_string": {
      "query": "al",
      "default_field": "user_display_name"
    }
  }
}

Search Result:

 "hits": [
      {
        "_index": "my-index",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.0087044,
        "_source": {
          "user_display_name": "Allen"
        }
      },
      {
        "_index": "my-index",
        "_type": "_doc",
        "_id": "2",
        "_score": 1.0087044,
        "_source": {
          "user_display_name": "Alec"
        }
      }
    ]

@k3np4ch1 did u get a chance to go through my answer, looking forward to get feedback from u :) — ESCoder, Aug 09 '20 at 03:44
Well, this would involve changing my index right? I am looking at doing changes only on the query side of things. No changes to the index settings cause that would involve a lot of work and there is a huge amount of data that needs to be updated — k3np4ch1, Aug 10 '20 at 17:51
@k3np4ch1 yes this would involve changing index, but **it makes more sense to use edge_ngram instead** — ESCoder, Aug 11 '20 at 06:13

score 0 · Answer 2 · answered Aug 08 '20 at 02:16

As you have mentioned that you want a solution which doesn't require change in the index, I would suggest you to use the prefix query but before sending the prefix query make sure that you lowercase your search term as I can see, you used keyword_ci which lowercase your usernames in the index, to provide case-insensitive search.

Let me show you a working example on your sample data

I created below minimal required mapping

{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "keyword_ci": {
            "filter": [
              "lowercase"
            ],
            "type": "custom",
            "tokenizer": "keyword"
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "user_display_name": {
        "type": "text",
        "analyzer": "keyword_ci"
      }
    }
  }
}

Index your four users

{
  "user_display_name" : "Polly"
}

Search query, please note prefix queries are not lowercased, so you need to do lowercasing in your application before using below prefix query

{
  "query": {
    "prefix" : { "user_display_name" : "al" }
  }
}

And below is your expected results

 "hits": [
      {
        "_index": "internaledgepre",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "user_display_name": "Allen"
        }
      },
      {
        "_index": "internaledgepre",
        "_type": "_doc",
        "_id": "2",
        "_score": 1.0,
        "_source": {
          "user_display_name": "Alec"
        }
      }
    ]

Also I've written a blog post on various techniques of partial search and my this SO answer talks about how to choose a partial search approach based on various factors. Please go through them to get deep understanding.

In this above example you aren't creating `ngram` tokens, and that's why the search is getting the appropriate results, so for example, my inverted index tokenizes `Kimball` into -> `kim`, `imb`, `mba`, `bal`, `all`, `kimb`, `imba`, `mbal`, `ball`, `kimba`, `imbal`, `mball`, `kimbal`, `imball`, `kimball` So it matches the token `all` with the prefix query. That's what my understanding is, correct me if I am missing something. — k3np4ch1, Aug 10 '20 at 17:49
@k3np4ch1 thanks for coming back on this, reason I am getting appropriate results is that I am not creating the ngram tokens, but I am creating the tokens based on keyword tokens which are lowercased, to provide the case insensitive search. and yes you are correct, due to ngram tokens, your prefix query matches the `kimball` for `all` prefix query which is not correct. — Amit, Aug 11 '20 at 02:26
thanks for replying, the only issue I have is that I already am creating the tokens using ngramfilter as well as the keyword filter, and I do not want to change that functionality, is there any work around without modifying our indexes? — k3np4ch1, Aug 13 '20 at 21:45
@k3np4ch1, what you are using is ngramfilter, not edge-ngram, hence you will not get better prefix results and what I gave you doesn't require much change and it will give huge benefit, hope you went through the blog posts and my other SO answer to understand trade-off better, you will have to change something and what I am suggesting would require the least changes. — Amit, Aug 14 '20 at 02:12

Using Exact Prefix/MatchPhrase Prefix Queries with Ngram Filter

2 Answers2