You would need to create a Custom Analyzer which implement Ngram Tokenizer
and then apply that on the text
field you create.
Below is the sample mapping, document, query and the response:
Mapping:
PUT my_split_index
{
"settings": {
"index":{
"max_ngram_diff": 3
},
"analysis": {
"analyzer": {
"my_analyzer": { <---- Custom Analyzer
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"product":{
"type": "text",
"analyzer": "my_analyzer", <--- Note this as how custom analyzer is applied on this field
"fields": {
"keyword":{
"type": "keyword"
}
}
}
}
}
}
The feature that you are looking for is called Ngram which would create multiple tokens from a single token. The size of the tokens are dependent on the min_ngram and max_ngram setting as mentioned above.
Note that I've mentioned max_ngram_diff
as 3, that is because in version 7.x, ES's default value is 1
. Looking into your use-case I've created this as 3
This value is nothing but max_ngram
- min_ngram
.
Sample Documents:
POST my_split_index/_doc/1
{
"product": "Varta 74 Ah"
}
POST my_split_index/_doc/2
{
"product": "lightbulb 220V"
}
Query Request:
POST my_split_index/_search
{
"query": {
"match": {
"product": "74Ah"
}
}
}
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.7029606,
"hits" : [
{
"_index" : "my_split_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.7029606,
"_source" : {
"product" : "Varta 74 Ah"
}
}
]
}
}
Additional Info:
To understand what tokens are actually generated you can make use of below Analyze API:
POST my_split_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Varta 74 Ah"
}
You could see that below tokens got generated when I execute the above API:
{
"tokens" : [
{
"token" : "Va",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "Var",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "Vart",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
},
{
"token" : "Varta",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 3
},
{
"token" : "ar",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 4
},
{
"token" : "art",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 5
},
{
"token" : "arta",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 6
},
{
"token" : "rt",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 7
},
{
"token" : "rta",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 8
},
{
"token" : "ta",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 9
},
{
"token" : "74",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 10
},
{
"token" : "Ah",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 11
}
]
}
Notice that the query I've mentioned in the Query Request
section is 74Ah
, however it still returns the document. That is because ES applies the analyzer twice, during the index time and during the search time. By default if you do not specify the search_analyzer
in your query, the same analyzer you applied during indexing time also gets applied during query time.
Hope this helps!