How to index a pdf file using Elasticsearch ingest-attachment plugin?

Question

I have to implement a full-text based search in a pdf document using Elasticsearch ingest plugin. I'm getting an empty hit array when I'm trying to search the word someword in the pdf document.

//Code for creating pipeline

PUT _ingest/pipeline/attachment
{
    "description" : "Extract attachment information",
    "processors" : [
      {
        "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
        }
      }
    ]
}

//Code for creating the index

PUT my_index/my_type/my_id?pipeline=attachment
{
   "filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
   "title" : "Quick",
   "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

}

//Code for searching the word in pdf 

GET /my_index/my_type/_search
{
    "query": {
    "match": {
      "data" : {
        "query" : "someword"
    }
 }
}

If you open the PDF in a PDF viewer, are you able to search for "someword" in it and find a match? — Alcanzar, Feb 08 '17 at 14:39
This looks like a duplicate of http://stackoverflow.com/questions/37861279/how-to-index-a-pdf-file-in-elasticsearch-5-0-0-with-ingest-attachment-plugin -- note that your PUT statement is putting a specific "data" for the file. You need to use curl or something like that to pass the specific file data. The "data" you are putting in is `Lorem ipsum dolor sit amet` -- if you search for Lorem, you'd find a result — Alcanzar, Feb 08 '17 at 14:55
@Alcanzar I verified by searching for Lorem by running the GET on Kibana dashboard. But still there are not hits. — Ashley, Feb 08 '17 at 15:29
@Alcanzar Can you pls tell me the theory behind the elasticsearch indexing unstructered data like pdf files? — Ashley, Feb 08 '17 at 15:58

Val · Accepted Answer · 2017-02-13T08:24:25.310

3

When you index your document with the second command by passing the Base64 encoded content, the document then looks like this:

        {
           "filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
           "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
           "attachment": {
              "content_type": "application/rtf",
              "language": "ro",
              "content": "Lorem ipsum dolor sit amet",
              "content_length": 28
           },
           "title": "Quick"
        }

So your query needs to look into the attachment.content field and not the data one (which only serves the purpose of sending the raw content during indexing)

Modify your query to this and it will work:

POST /my_index/my_type/_search
{
   "query": {
      "match": {
         "attachment.content": {         <---- change this
            "query": "lorem"
         }
      }
   }
}

PS: Use POST instead of GET when sending a payload

edited Feb 13 '17 at 08:24

answered Feb 11 '17 at 14:38

Val

207,596
13
358
360

1

Any idea on how can we convert a pdf file to base64 encoded file using elastic search ? – Ashley Feb 13 '17 at 08:47
I think this should be a new question as it is unrelated to this one. – Val Feb 13 '17 at 08:49
Why did you use POST instead of GET ? The later works fine for me – Ashley Feb 13 '17 at 09:00
It depends on the HTTP client you're using, but you should **NEVER** send a payload via GET (= not HTTP compliant). See a more detailed example here: http://stackoverflow.com/questions/34795053/es-keeps-returning-every-document/34796014#34796014 – Val Feb 13 '17 at 09:13
@Ashley afaik that is done before data is sent to elastic search, so you do the conversion with whatever method exists in the programming language you are using. – Rui Marques Dec 23 '21 at 14:54

How to index a pdf file using Elasticsearch ingest-attachment plugin?

1 Answers1