User-provided term vectors for highlighting in Elasticsearch

Question

I want to use Elasticsearch's highlighting features in search results, but I can't use an analyzer plugin. Our (very custom) NLP pipeline is fairly heavy (in CPU and memory, and in production it may talk to other services for e.g. dictionary resolution).

Currently we turn a plain text document into a list of tokens, so The quick siberian fox jumps over the grizzly bear becomes {"text": "The quick siberian fox jumps over the grizzly bear", "tokens": ["quick", "siberian fox", "jump", "grizzly bear"]}. Then we just insert the above as a document, which contains 2 fields, text and tokens, and we do most of our search as exact matches on the tokens field. So far so good.

Now we are considering highlighting matches in the original text, so if a user searches for "jump" we want to return The quick siberian fox [jumps] over the grizzly bear. However, as far as I can tell, the Elasticsearch highlighting engine depends on analyzing the plain text either at indexing time or at query time, to obtain term vectors, which contain position info. (Is this correct?)

Because we can't write an analyzer plugin for ES, we can't rely on this method. However, we do produce position info when running the NLP pipeline on a plain text string, so can we provide term vectors at indexing time? I've found User defined termvectors in ElasticSearch but the single answer focuses on the application (KNN) instead of the problem of inserting term vectors manually.

Alternatively, is there a different way of doing highlighting that we can use? I've found https://www.elastic.co/blog/search-for-things-not-strings-with-the-annotated-text-plugin but I'm not sure how it would behave if we just indexed stuff like the [quick](quick) [siberian fox](siberian fox) [jumps](jump) over the [grizzly bear](grizzly bear) where almost everything would be annotated.

You should use highlight for your purposes. Term vector are another thing.... Why you should tokenize in that way? It really has no sense and only creates problems — Lupanoide, Jan 07 '19 at 13:17
@Lupanoide The tokenization is completely custom, as I said before. We have our own dictionary and use opennlp/stanford-nlp with custom trained models, so we really can't use Elasticsearch's built-in analyzers. Is highlighting really not related to term vectors? I may have misunderstood the documentation. — Fabrice Gabolde, Jan 08 '19 at 10:55

User-provided term vectors for highlighting in Elasticsearch

0 Answers0