Split text containing into 3 tokens

Question

We index a lot of documents that may contain titles like "lightbulb 220V" or "Box 23cm" or "Varta Super-charge battery 74Ah". However our users, when searching, tend to separate number and unit with whitespace, so they search for "Varta 74 Ah" they do not get what they expect. The above is a simplification of the problem, but the main question is hopefully valid. How can I analyze "Varta Super-charge battery 74Ah" so that (on top of other tokens) 74, Ah and 74Ah are created?

Thanks,

Michal

did you get a chance to look at my answer? let me know if you have further questions. — Amit, Mar 10 '20 at 05:26

Harshit · Answer 1 · 2020-03-09T15:56:18.560

I guess this will help you:

PUT index_name
{
  "settings": {
    "analysis": {
      "filter": {
        "custom_filter": {
          "type": "word_delimiter",
          "split_on_numerics": true
        }
      },
      "analyzer": {
        "custom_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": ["custom_filter"]
        }
      }
    }
  }
}

You can use split_on_numerics property in your custom filter. This will give you this response:

POST

POST /index_name/_analyze
{
  "analyzer": "custom_analyzer",
  "text": "Varta Super-charge battery 74Ah"
}

Response

{
  "tokens" : [
    {
      "token" : "Varta",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Super",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "charge",
      "start_offset" : 12,
      "end_offset" : 18,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "battery",
      "start_offset" : 19,
      "end_offset" : 26,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "74",
      "start_offset" : 27,
      "end_offset" : 29,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "Ah",
      "start_offset" : 29,
      "end_offset" : 31,
      "type" : "word",
      "position" : 5
    }
  ]
}

That's an interesting option, from the other options here I think it's the most viable one and I'll definitely look into that. — Michal Holub, Mar 10 '20 at 09:36
@MichalHolub Sure. I hope it'll help you out. Let me know if you still face any issues. Don't forget to upvote if you liked my answer :) — Harshit, Mar 10 '20 at 12:57

Kamal Kunjapur · Answer 2 · 2020-03-09T16:08:28.200

You would need to create a Custom Analyzer which implement Ngram Tokenizer and then apply that on the text field you create.

Below is the sample mapping, document, query and the response:

Mapping:

PUT my_split_index
{
  "settings": {
    "index":{
      "max_ngram_diff": 3
    },
    "analysis": {
      "analyzer": { 
        "my_analyzer": {                     <---- Custom Analyzer
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 5,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "product":{
        "type": "text",
        "analyzer": "my_analyzer",       <--- Note this as how custom analyzer is applied on this field
        "fields": {
          "keyword":{
            "type": "keyword"
          }
        }
      }
    }
  }
}

The feature that you are looking for is called Ngram which would create multiple tokens from a single token. The size of the tokens are dependent on the min_ngram and max_ngram setting as mentioned above.

Note that I've mentioned max_ngram_diff as 3, that is because in version 7.x, ES's default value is 1. Looking into your use-case I've created this as 3 This value is nothing but max_ngram - min_ngram.

Sample Documents:

POST my_split_index/_doc/1
{
  "product": "Varta 74 Ah"
}

POST my_split_index/_doc/2
{
  "product": "lightbulb 220V"
}

Query Request:

POST my_split_index/_search
{
  "query": {
    "match": {
      "product": "74Ah"
    }
  }
}

Response:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.7029606,
    "hits" : [
      {
        "_index" : "my_split_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.7029606,
        "_source" : {
          "product" : "Varta 74 Ah"
        }
      }
    ]
  }
}

Additional Info:

To understand what tokens are actually generated you can make use of below Analyze API:

POST my_split_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Varta 74 Ah"
}

You could see that below tokens got generated when I execute the above API:

{
  "tokens" : [
    {
      "token" : "Va",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Var",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Vart",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "Varta",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "ar",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "art",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "arta",
      "start_offset" : 1,
      "end_offset" : 5,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "rt",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "rta",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "ta",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "74",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "Ah",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "word",
      "position" : 11
    }
  ]
}

Notice that the query I've mentioned in the Query Request section is 74Ah, however it still returns the document. That is because ES applies the analyzer twice, during the index time and during the search time. By default if you do not specify the search_analyzer in your query, the same analyzer you applied during indexing time also gets applied during query time.

Hope this helps!

But this would not works if numbers have more digit like `123456789` and to make it work, you need to increase the diff b/w min and max gram, which would increase index size a lot — , Mar 09 '20 at 15:49
Yup, the selection of value of `ngram` should be carefully thought of while using it otherwise it can lead to huge disk space without necessarily getting much benefits. However, the above solution works for the number you've mentioned. — Kamal Kunjapur, Mar 09 '20 at 16:06
How it can work with diff of gram just `3`?, I just used your example to create the index and tested it with text `Box 123456789az`, please see there is no `123456789` token genrated, this is what i meant, let me know if I am missing anything — , Mar 09 '20 at 16:33
@es-enthu I know what you are trying to say, but the point is even during the search time, the ngram tokens gets created. Which means if you use a simple match query with the value `123456789`, it would in fact search with the n-gram tokens. That is the reason there would be a match and that the document with `Box 123456789az` would return. — Kamal Kunjapur, Mar 09 '20 at 16:52
but isn't common to define search time analyzer, which doesn't spit the input search terms on `n-grams`? — , Mar 09 '20 at 16:56
As mentioned in the link, `By default, queries will use the analyzer defined in the field mapping, but this can be overridden with the search_analyzer setting:` https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html — Kamal Kunjapur, Mar 09 '20 at 16:59
yeah,I know about it, I said `isn't it common to have` and again passing it through same analyzer will cause hell lot of search results, which would cause performance as well bad relevance — , Mar 09 '20 at 17:11
`but isn't common to define` You didn't mention `it` so I assumed you mean `it isn't` :) Sorry. I mean yes relevancy would be a pain for this use case, but I guess people can fine tune it using multiple words or one has to go through some proper use cases and testing before they would go ahead with the solution for better match. I would let the OP decide if this solution fits his use-case as there are two more solutions that might fit his requirement. But I guess I didn't focus on number-unit in the question but rather generic sub-string match. — Kamal Kunjapur, Mar 09 '20 at 17:25
thanks, its nice interacting with you, I am also learning ES and would like to know In your opinion, which solution works best? — , Mar 09 '20 at 17:44
@es-enthu Happy to help anytime and its good to see your queries too!! Keep learning(like me), you are in the right direction. With regards to the best solution, I would probably let the OP decide that as he would have to understand his business and see what's best fits him. All 3 solutions are good for his use-case. Having a good search solution takes lot of effort, iterations...its more like an evolutionary process. Hope that helps! — Kamal Kunjapur, Mar 09 '20 at 18:16
Sorry If I confused you with my english, what I meant is if you have any other solution which would work best in this cases, I see a lot of similar question and in general curious about the best solutions, but I understand you last point which kinds explains this — , Mar 09 '20 at 19:10
Thanks for comments. I'm afraid that this won't be the way to go. As others suggested, the number in may be 3-5 chars long, unit can be 1 character long (Volt, Amper...etc) - ngrams with such wide margins would be useless. — Michal Holub, Mar 10 '20 at 09:34

Amit · Answer 3 · 2020-03-10T13:14:05.833

You can define your index mapping as below and see it generates tokens, as you mentioned in your question. Also, it doesn't create a lot of tokens. Hence the size of your index would be smaller.

Index mapping

 {
    "settings": {
        "analysis": {
            "filter": {
                "my_filter": {
                    "type": "word_delimiter",
                    "split_on_numerics": "true",
                    "catenate_words": "true",
                    "preserve_original": "true"
                }
            },
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [
                        "my_filter",
                        "lowercase"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "my_analyzer"
            }
        }
    }
}

And check the tokens generated using `_analyze` API

   {
    "text": "Varta Super-charge battery 74Ah",
    "analyzer" : "my_analyzer"
}

Tokens generated

{
    "tokens": [
        {
            "token": "varta",
            "start_offset": 0,
            "end_offset": 5,
            "type": "word",
            "position": 0
        },
        {
            "token": "super-charge",
            "start_offset": 6,
            "end_offset": 18,
            "type": "word",
            "position": 1
        },
        {
            "token": "super",
            "start_offset": 6,
            "end_offset": 11,
            "type": "word",
            "position": 1
        },
        {
            "token": "supercharge",
            "start_offset": 6,
            "end_offset": 18,
            "type": "word",
            "position": 1
        },
        {
            "token": "charge",
            "start_offset": 12,
            "end_offset": 18,
            "type": "word",
            "position": 2
        },
        {
            "token": "battery",
            "start_offset": 19,
            "end_offset": 26,
            "type": "word",
            "position": 3
        },
        {
            "token": "74ah",
            "start_offset": 27,
            "end_offset": 31,
            "type": "word",
            "position": 4
        },
        {
            "token": "74",
            "start_offset": 27,
            "end_offset": 29,
            "type": "word",
            "position": 4
        },
        {
            "token": "ah",
            "start_offset": 29,
            "end_offset": 31,
            "type": "word",
            "position": 5
        }
    ]
}

Edit: Tokens generated in one another might look the same in the first glace, But I made sure that it satisfies all your requirements, given in question and tokens generated are quite different in close inspection, details of which are below:

My tokens generated are all in small-case to provide the case insensitive search functionality, which is implicit in all the search engines.
The critical thing to note is tokens generated as 74ah and supercharge, this is mentioned in the question, and my analyzer provides these tokens as well.

Like above, from @Harshit, thanks for your suggestion, it looks like it may work, will need to test it with my strings — Michal Holub, Mar 10 '20 at 09:37
@MichalHolub, hope you are fine, it would be great if you can provide further updates as last update was almost 1 week before, and I am curious if it solved ur issue or not — Amit, Mar 15 '20 at 11:47