How to get exact page number and line number of the highlighted fields in elasticsearch for files using elasticsearch-mapper-attachments plugin

Question

I am using elasticsearch-mapper-attachments plugin to fetch data from files. Is there any way of getting exact page number and line number of the highlighted fields? My current mapping for the index is given below.

{
    "type_name" : {
          "content" : {"term_vector" : "with_positions_offsets"}
    }
}

According to [this open issue](https://github.com/elastic/elasticsearch-mapper-attachments/issues/135), I don't think it's currently possible. — Val, Jul 28 '15 at 13:42
Thank you @Val. After searching and going through a lot of documentation even I think its not currently possible. — Prashant N, Jul 30 '15 at 06:06

score 1 · Answer 1 · edited May 23 '17 at 11:54

I have dug somewhat in the Mapper Attachments plugin and I find it very inflexible and unperformant. You're also mixing concerns (indexing/text extraction), which will make performance tuning more complex.

First: You will be better off installing Tika and extracting the text yourself (which will also probably improve performance as you're not sending large base64-encoded BLOBs by HTTP over to ES, and you're keeping a separate heap/process for the text extraction purpose).

Second: Is it possible to extract text by page for word/pdf files using Apache Tika?

Third: Possibly, index each page as a separate field (for example "pdf_page_1", "pdf_page_2" etc), then you will perhaps get back the field name for each hit and thus be able to retrieve the page number for your hits.

Another solution which is perhaps more flexible, is to a) index your documents with the PDF file contents all in one field (array), like pdf_contents: ["here comes page 1", "here comes page 2"], and b) create a separate index for pdf file contents, indexing each page as a separate document, including a field for the page number.

Then, do one query for your "canonical" result list, and when you have the hits, do a subsequent query on the pdf file contents index, filtering out those documents not in the result list.

How to get exact page number and line number of the highlighted fields in elasticsearch for files using elasticsearch-mapper-attachments plugin

1 Answers1