You could try to install a specific plugin developed by wikimedia foundation called Experimental Highlighter -github here
You can install for elasticsearch 7.5 in this way - for other elasticsearch versions please refer to the github project page:
./bin/elasticsearch-plugin install org.wikimedia.search.highlighter:experimental-highlighter-elasticsearch-plugin:7.5.1
And restart elasticsearch.
Inasmuch you need to retrieve also the positions
- if for your use case the offsets can replace the positions please go on to the next paragraph - you should declare your field with termvector with the index option "with_position_offset_payloads"
- doc here
PUT /my-index-000001
{ "mappings": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",
"analyzer" : "fulltext_analyzer"
}
}
}
}
For other cases that don't need to retrieve also the position, it is faster and uses much less space to use the index option "offsets"
- elastic doc here, plugin doc here:
PUT /my-index-000001
{ "mappings": {
"properties": {
"text": {
"type": "text",
"index_options": "offsets",
"analyzer" : "fulltext_analyzer"
}
}
}
}
Then you could query with the experimental highlighter and return only offset of the highlighter part:
{
"query": {
"match": {
"text": "hello world"
}
},
"highlight": {
"order": "score",
"fields": {
"text": {
"number_of_fragments": 10,
"fragment_size": 15,
"type": "experimental",
"options": {"return_offset": true}
}
}
}
}
In this way no text is returned from your query but only the start offset
and the end offset
- numbers that represent position. To retrieve your highlighted content you need to enter inside ['hits']['hits'][0]['_source']['text']
-text is your field name - and extract text from the field using your start offset point and the end offset point. You need to ensure to use the correct string encoding - UTF-8
- otherwise the offsets don't match text. According to the doc:
The return_offsets option changes the results from a highlighted
string to the offsets in the highlighted that would have been
highlighted. This is useful if you need to do client side sanity
checking on the highlighting. Instead of a marked up snippet you'll
get a result like 0:0-5,18-22:22. The outer numbers are the start and
end offset of the snippet. The pairs of numbers separated by the ,s
are the hits. The number before the - is the start offset and the
number after the - is the end offset. Multi-valued fields have a
single character worth of offset between them.
Let me know if that plugin could help!