I want to use Elasticsearch's highlighting features in search results, but I can't use an analyzer plugin. Our (very custom) NLP pipeline is fairly heavy (in CPU and memory, and in production it may talk to other services for e.g. dictionary resolution).
Currently we turn a plain text document into a list of tokens, so The quick siberian fox jumps over the grizzly bear
becomes {"text": "The quick siberian fox jumps over the grizzly bear", "tokens": ["quick", "siberian fox", "jump", "grizzly bear"]}
. Then we just insert the above as a document, which contains 2 fields, text
and tokens
, and we do most of our search as exact matches on the tokens
field. So far so good.
Now we are considering highlighting matches in the original text, so if a user searches for "jump" we want to return The quick siberian fox [jumps] over the grizzly bear
. However, as far as I can tell, the Elasticsearch highlighting engine depends on analyzing the plain text either at indexing time or at query time, to obtain term vectors, which contain position info. (Is this correct?)
Because we can't write an analyzer plugin for ES, we can't rely on this method. However, we do produce position info when running the NLP pipeline on a plain text string, so can we provide term vectors at indexing time? I've found User defined termvectors in ElasticSearch but the single answer focuses on the application (KNN) instead of the problem of inserting term vectors manually.
Alternatively, is there a different way of doing highlighting that we can use? I've found https://www.elastic.co/blog/search-for-things-not-strings-with-the-annotated-text-plugin but I'm not sure how it would behave if we just indexed stuff like the [quick](quick) [siberian fox](siberian fox) [jumps](jump) over the [grizzly bear](grizzly bear)
where almost everything would be annotated.