Index PDF with Elasticsearch Page by page Vs using ingest plugin

Asked Jul 14 '18 at 20:18

Active Jul 14 '18 at 20:18

Viewed 968 times

I am doing a project to index a bunch of PDF documents, for this task I've chosen Elasticsearch, as it is based on Apache Lucene. Checking out several docs

https://www.elastic.co/guide/en/elasticsearch/plugins/6.3/using-ingest-attachment.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest-processors.html#ingest-processors

and Stackoverflow questions: How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?

In terms of performance, storage space,and effectiveness what would be a better approach, to use the ingest plugin as described, or to parse the pdf and store every page, two, or three (this can be a changing parameter) and put them in a separate document ?

asked Jul 14 '18 at 20:18

Adelin

18,144
26
115
175

We got into same requirments, please share your inputs – Nag Feb 25 '21 at 05:43
I also have a similar problem to solve, do let us know if you found any efficient solution. – user_12 Jun 04 '21 at 19:11

Index PDF with Elasticsearch Page by page Vs using ingest plugin

0 Answers0