I'm building an analyser to provide partial search on term. So I want to use 2-5 ngram tokenzier at index time and 5-5 ngram at search.
The rational of using 2-5 ngram at index time is that the a partial term query of lenght 2 shall match.
At search, if the search term has a length lower than 5, the term can be searched directly in the inverted index. If it has a len greater than 5, then the term is tokenized with 5-grams and match if all token match.
However, in Elastic, using 5-5 ngram tokenziser won't create any token if the query term has a length lower than 5. The solution could be to use at search a 2-5 tokenizer, same as for indexing, but this would result in searching all the 2grams, 3grams and 4grams tokens, which is useless... (5grams token is sufficient)
Here is my current index mapping:
{
"settings" : {
"analysis":{
"analyzer":{
"index_partial":{
"type":"custom",
"tokenizer":"2-5_ngram_token"
},
"search_partial":{
"type":"custom",
"tokenizer": "5-5_ngram_token"
}
},
"tokenizer":{
"2-5_ngram_token": {
"type":"nGram",
"min_gram":"2",
"max_gram":"5"
},
"5-5_ngram_token": {
"type":"nGram",
"min_gram":"5",
"max_gram":"5"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"name_trans": {
"type": "text",
"fields": {
"partial": {
"type":"text",
"analyzer":"index_partial",
"search_analyzer":"search_partial"
}
}
}
}
}
}
So my question is : How can create analyzer that would do no-op if the search query has a length lower than 5. If it has a length greater than 5, it creates 5 grams tokens ?
----------------------UPDATE WITH WORK AROUND SOLUTION-----------------------
It seems not possible to create an analyser that do no-op if len < 5 and 5-5ngram if len >= 5.
There is two work around solutions to perform partial:
1- As mentionned by @Amit Khandelwal, one solution is to use max ngrams at index time. If your field has 30 chars max, use a tokenizer with ngram 2-30 and at searh time, search for the exact term, without processing it with the ngram analyser (either via term query or by setting the search analyszer to keyword).
Drawback of this solution is that it could result in huge inverted index depending on the max length.
2- Other solution is to create two fields: - one for short search query term that can be look for in the inverted index directly, without being tokenized - one for longer search query term that shall be tokenized Depending of the length of the search query term, the search shall be performed on either one of those two fields
Below is the mapping I used for solution 2 (the limit between short and long term I chose is len=5):
PUT name_test
{
"settings" : {
"max_ngram_diff": 3,
"analysis":{
"analyzer":{
"2-4nGrams":{
"type":"custom",
"tokenizer":"2-4_ngram_token",
"filter": ["lowercase"]
},
"5-5nGrams":{
"type":"custom",
"tokenizer": "5-5_ngram_token",
"filter": ["lowercase"]
}
},
"tokenizer":{
"2-4_ngram_token": {
"type":"nGram",
"min_gram":"2",
"max_gram":"4"
},
"5-5_ngram_token": {
"type":"nGram",
"min_gram":"5",
"max_gram":"5"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"name_trans": {
"type": "text",
"fields": {
"2-4partial": {
"type":"text",
"analyzer":"2-4nGrams",
"search_analyzer":"keyword"
},
"5-5partial": {
"type":"text",
"analyzer":"5-5nGrams"
}
}
}
}
}
}
and the two kind of request to be used with this mapping depending search term length:
GET name_test/_search
{
"query": {
"match": {
"name_trans.2-4partial": {
"query": "ema",
"operator": "and",
"fuzziness": 0
}
}
}
}
GET name_test/_search
{
"query": {
"match": {
"name_trans.5-5partial": {
"query": "emanue",
"operator": "and",
"fuzziness": 0
}
}
}
Maybe this will help someone someday :)