How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?

Question

I'm new to Elasticsearch and I read here https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html that the mapper-attachments plugin is deprecated in elasticsearch 5.0.0.

I now try to index a pdf file with the new ingest-attachment plugin and upload the attachment.

What I've tried so far is

curl -H 'Content-Type: application/pdf' -XPOST localhost:9200/test/1 -d @/cygdrive/c/test/test.pdf

but I get the following error:

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}

I would expect that the pdf file will be indexed and uploaded. What am I doing wrong?

I also tested Elasticsearch 2.3.3 but the mapper-attachments plugin is not valid for this version and I don't want to use any older version of Elasticsearch.

Evis · Accepted Answer · 2016-11-11T19:13:27.437

24

You need to make sure you have created your ingest pipeline with:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}

Then you can make a PUT not POST to your index using the pipeline you've created.

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

In your example, should be something like:

curl -H 'Content-Type: application/pdf' -XPUT localhost:9200/test/1?pipeline=attachment -d @/cygdrive/c/test/test.pdf

Remembering that the PDF content must be base64 encoded.

Hope it will help you.

Edit 1 Please make sure to read these, it helped me a lot:

Elastic Ingest

Ingest Plugin

Ingest Presentation

Edit 2

Also, you must have ingest-attachment plugin installed.

./bin/elasticsearch-plugin install ingest-attachment

Edit 3

Please, before you create your ingest processor (attachment), create your index, map with the fields you will use and make sure you have the data field in your map (same name of the "field" in your attachment processor), so ingest will process and fullfill your data field with your pdf content.

I inserted the indexed_chars option in the ingest processor, with -1 value, so you can index large pdf files.

Edit 4

The mapping should be something like that:

PUT my_index
{ 
    "mappings" : { 
        "my_type" : { 
            "properties" : { 
                "attachment.data" : { 
                    "type": "text", 
                    "analyzer" : "brazilian" 
                } 
            } 
        } 
    } 
}

In this case, I use the brazilian filter, but you can remove that or use your own.

edited Nov 11 '16 at 19:13

answered Oct 30 '16 at 21:54

Evis

571
8
22

Why do you need a mapping for the data field? Doesn't the pipeline pick up the data field and process it without it having to be explicitly mapped? What would this mapping look like? – bjlevine Nov 11 '16 at 19:04
@bjlevine you do not need to map the field actually... the processor will create an inside (of your processor) the field. But sometimes you need to have some filter like the updated answer. hope it helps – Evis Nov 11 '16 at 19:10
1

I've fought a lot with Ingest Attachment plugin. It can't be used in production. I use Ambar (http://ambar.rdseventeen.com) as a solid solution for stroing and searching through documents – Ilia P Jan 31 '17 at 11:43
@SochiX sure we can use it in production, as it is in production in several cases. I myself has a project running in production mode and running pretty well. not bit deal, but there are 1K files, over 2Gb data and search results less than 1 second. – Evis Jan 31 '17 at 12:33
1

@Evert thank you for your comment. But in my case I have 3 000 000 files and total size of index is 268 Gb. Ingest attachment just eats all the RAM when it tries to proccess the file larger then 40 MB. That's why I switched to Ambar. – Ilia P Feb 01 '17 at 13:05
I wrote a blog post about Ingest Attachment plugin problems. Check it out: [Ingest Attachment Plugin for ElasticSearch: Should You Use It?](https://blog.ambar.cloud/ingest-attachment-plugin-for-elasticsearch-should-you-use-it/) – Ilia P Apr 04 '17 at 13:54
1

@SochiX you are the developer of Ambar, nice, I understand your enthusiasm. I will give it a try, but confess... I am really happy with ES + Ingest. Cheers! – Evis Apr 04 '17 at 14:38
Hey, I get how to create the pipeline and am able to push a document to my index. However, what if my pdf was to be a stored in a field of my django object. How do I index other fields and this pdf? – rishran Apr 22 '17 at 02:19
@RishabhRanawat as in my **Edit 4** you just enter the properties (fields) you need, as of the official documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html, and when indexing, just fill your post with the data as needed. Hope it was of help. – Evis Apr 25 '17 at 12:12
when using `XPUT` to post pdf i get `{"error":"Content-Type header [application/pdf] is not supported","status":406}` any suggestions? – Augustas May 02 '17 at 13:36
@Augustas are you using Curl? Which version of es you are using? I suggest post a new question and post your code so we could help you better. – Evis May 02 '17 at 13:54
How do you query this document ? – Adelin Jul 14 '18 at 17:47
Hi @IlyaP, Did you find the solution to your problem for huge data? – sushilprj Jul 27 '20 at 19:31
@sushilprj yep, try Ambar! – Ilia P Jul 28 '20 at 09:47
@IlyaP, Ambar looks great! How does it compare with https://github.com/opensemanticsearch ? – Stacky Jul 28 '20 at 17:09
Hi, @Stacky it's more lightweight I guess and easy to setup/host – Ilia P Jul 29 '20 at 10:01
@IlyaP I created a question on Ambar github to avoid hijacking this thread. Could you comment? Thanks! https://github.com/RD17/ambar/issues/287 – Stacky Aug 03 '20 at 16:20
Good tips you have provided! – Rui Marques Dec 23 '21 at 14:49

How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?

1 Answers1

Linked