analyser with ngram token depending on term length

Question

I'm building an analyser to provide partial search on term. So I want to use 2-5 ngram tokenzier at index time and 5-5 ngram at search.

The rational of using 2-5 ngram at index time is that the a partial term query of lenght 2 shall match.

At search, if the search term has a length lower than 5, the term can be searched directly in the inverted index. If it has a len greater than 5, then the term is tokenized with 5-grams and match if all token match.

However, in Elastic, using 5-5 ngram tokenziser won't create any token if the query term has a length lower than 5. The solution could be to use at search a 2-5 tokenizer, same as for indexing, but this would result in searching all the 2grams, 3grams and 4grams tokens, which is useless... (5grams token is sufficient)

Here is my current index mapping:

{
  "settings" : {
   "analysis":{
        "analyzer":{
           "index_partial":{
              "type":"custom",
              "tokenizer":"2-5_ngram_token"
           },
           "search_partial":{
              "type":"custom",
              "tokenizer": "5-5_ngram_token"
           }
        },
        "tokenizer":{
           "2-5_ngram_token": {
              "type":"nGram",
              "min_gram":"2",
              "max_gram":"5"
           },
           "5-5_ngram_token": {
              "type":"nGram",
              "min_gram":"5",
              "max_gram":"5"
           }
        }
      }
   },
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "name_trans": {
        "type": "text",
        "fields": {
          "partial": {
           "type":"text",
           "analyzer":"index_partial",
           "search_analyzer":"search_partial"
          }
        }
      }
    }
  }
}

So my question is : How can create analyzer that would do no-op if the search query has a length lower than 5. If it has a length greater than 5, it creates 5 grams tokens ?

----------------------UPDATE WITH WORK AROUND SOLUTION-----------------------

It seems not possible to create an analyser that do no-op if len < 5 and 5-5ngram if len >= 5.

There is two work around solutions to perform partial:

1- As mentionned by @Amit Khandelwal, one solution is to use max ngrams at index time. If your field has 30 chars max, use a tokenizer with ngram 2-30 and at searh time, search for the exact term, without processing it with the ngram analyser (either via term query or by setting the search analyszer to keyword).

Drawback of this solution is that it could result in huge inverted index depending on the max length.

2- Other solution is to create two fields: - one for short search query term that can be look for in the inverted index directly, without being tokenized - one for longer search query term that shall be tokenized Depending of the length of the search query term, the search shall be performed on either one of those two fields

Below is the mapping I used for solution 2 (the limit between short and long term I chose is len=5):

PUT name_test
{
  "settings" : {
   "max_ngram_diff": 3,
   "analysis":{
        "analyzer":{
           "2-4nGrams":{
              "type":"custom",
              "tokenizer":"2-4_ngram_token",
              "filter": ["lowercase"]
           },
           "5-5nGrams":{
              "type":"custom",
              "tokenizer": "5-5_ngram_token",
              "filter": ["lowercase"]
           }
        },
        "tokenizer":{
           "2-4_ngram_token": {
              "type":"nGram",
              "min_gram":"2",
              "max_gram":"4"
           },
           "5-5_ngram_token": {
              "type":"nGram",
              "min_gram":"5",
              "max_gram":"5"
           }
        }
      }
   },
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "name_trans": {
        "type": "text",
        "fields": {
          "2-4partial": {
           "type":"text",
           "analyzer":"2-4nGrams",
           "search_analyzer":"keyword"
          },
          "5-5partial": {
           "type":"text",
           "analyzer":"5-5nGrams"
          }
        }
      }
    }
  }
}

and the two kind of request to be used with this mapping depending search term length:

GET name_test/_search
{
  "query": {
    "match": {
      "name_trans.2-4partial": {
        "query": "ema",
        "operator": "and",
        "fuzziness": 0
      }
    }
  }
}

GET name_test/_search
{
  "query": {
    "match": {
      "name_trans.5-5partial": {
        "query": "emanue",
        "operator": "and",
        "fuzziness": 0
      }
    }
  }

Maybe this will help someone someday :)

score 0 · Answer 1 · answered Jun 10 '19 at 15:53

0

I am not sure if it's possible in Elasticsearch or not, But I can suggest you a workaround which we also use in our application although our use case was different.

Create a custom analyzer using 2-5 ngram tokenzier on the fields, which you want to use for the partial search, this will store the ngram tokens of the fields in inverted index, for example for a field containing foobar as a value, it will store fo, foo, foob, fooba, oo, oob , ooba, oobar ,ob, oba ,obar, ba, bar, ar.
Now instead of match query use the term query on partial fields, which is not analyzed, you can read diff b/w these here.
So now, in this case, It doesn't matter whether the search term is smaller than 5 or not, it will still match the tokens and you will get the results.

Now lets dry run this on the field containing foobar as a value and test it against some search terms,

Case 1: If search term contains less than 5 chars like fo, oo, ar, bar , oob, oba, bar and ooba, still it will match as these tokens are present in the inverted index.

Case 2: Search term contains equal or more than 5 chars, like fooba, oobar then also it return the document as index contains these tokens.

Let me know if its clear or you require additional clarification.

answered Jun 10 '19 at 15:53

Amit

30,756
6
57
88

Hello, thanks for your help. However using the term query won't work if the search term is more than 5 chars. Term of len > 5 won't match in the inverted index as it won't be tokenized into 5 grams token. For instance if I index a "fooAndBar" and search for "fooAnd" with a term query, I won't get any result. Am i right? – cylon86 Jun 11 '19 at 08:31
then in this case, you need to configure the ngram tokenizer with max n-gram according to your application requirement, let's suppose if you set this to 30 which we used in our application, then, in this case, you can search for terms upto `30` chars which should be good enough for partial search. Hope I am clear and if not let me know and I can explain further – Amit Jun 11 '19 at 14:15
Thanks for this clarification. Indeed we were also considering using the max ngram. But in our case, the max is 1000 chars, which is a lot! We fear the huge inverted index could cause some trouble in other area (index maintenance, fuzzy search, ..etc). So we were looking for an alternative with shorter ngram, I will update my post with a solution that can work. – cylon86 Jun 12 '19 at 08:33
@cylon86 were u able to figure out the solution? – Amit Nov 27 '19 at 03:41
@AmitKhandelwl Yes, a work around solution. I updated my post above with the work around solution I used (solution 2) – cylon86 Nov 27 '19 at 10:29

analyser with ngram token depending on term length

1 Answers1