TL;DR
Is it possible to have Elasticsearch return the matched input-shingle alongside the matched document in a fuzzed query?
Example:
Lets say I have a shingle:
"fulltext_shingle_filter":{
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3,
"output_unigrams": false
}
And that shingle is used in a custom search-analyzer:
"fulltext_shingle":{
"type": "custom",
"tokenizer": "standard",
"filter":["fulltext_shingle_filter"]
}
The index is analyzed as keywords like so:
"whitelist_keyword": {
"type": "custom",
"tokenizer": "keyword"
}
with documents looking something like this:
{
"_source": {
"names": [
"John Smith",
"Smith, John"
]
},
{
"_source": {
"names": [
"Mr Wayne"
]
}
And we query like this:
POST /someindex/_search
{
"query": {
"match": {
"names": {
"query": "Hi, my Name is John Smit, I like toast.",
"analyzer": "fulltext_shingle",
"fuzziness": 1
}
}
}
}
This would split the query using our fulltext_shingle-analyzer and apply a fuzziness of 1 to, among other, the shingle "John Smit". Elasticsearch then returns the document containing "John Smith" as the Levenshtein-Distance is equal to 1.
Now, is it possible to have elasticsearch return the input-shingle used before the fuzzing i.e. "John Smit" alongside the matched document?
The only thing I could think of was to essentially reverse the query, i.e. index the query-document with shingles enabled and then query for the original output ("John Smith") with the same fuzziness. But that seems like a terribly error-prone and resource-wasting hastle to me.