5

I have to implement a full-text based search in a pdf document using Elasticsearch ingest plugin. I'm getting an empty hit array when I'm trying to search the word someword in the pdf document.

//Code for creating pipeline

PUT _ingest/pipeline/attachment
{
    "description" : "Extract attachment information",
    "processors" : [
      {
        "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
        }
      }
    ]
}

//Code for creating the index

PUT my_index/my_type/my_id?pipeline=attachment
{
   "filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
   "title" : "Quick",
   "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

}

//Code for searching the word in pdf 

GET /my_index/my_type/_search
{
    "query": {
    "match": {
      "data" : {
        "query" : "someword"
    }
 }
}
Ashley
  • 441
  • 2
  • 8
  • 27
  • If you open the PDF in a PDF viewer, are you able to search for "someword" in it and find a match? – Alcanzar Feb 08 '17 at 14:39
  • @Alcanzar Yeah it searches for the word. – Ashley Feb 08 '17 at 14:51
  • 1
    This looks like a duplicate of http://stackoverflow.com/questions/37861279/how-to-index-a-pdf-file-in-elasticsearch-5-0-0-with-ingest-attachment-plugin -- note that your PUT statement is putting a specific "data" for the file. You need to use curl or something like that to pass the specific file data. The "data" you are putting in is `Lorem ipsum dolor sit amet` -- if you search for Lorem, you'd find a result – Alcanzar Feb 08 '17 at 14:55
  • @Alcanzar I verified by searching for Lorem by running the GET on Kibana dashboard. But still there are not hits. – Ashley Feb 08 '17 at 15:29
  • @Alcanzar Can you pls tell me the theory behind the elasticsearch indexing unstructered data like pdf files? – Ashley Feb 08 '17 at 15:58

1 Answers1

3

When you index your document with the second command by passing the Base64 encoded content, the document then looks like this:

        {
           "filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
           "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
           "attachment": {
              "content_type": "application/rtf",
              "language": "ro",
              "content": "Lorem ipsum dolor sit amet",
              "content_length": 28
           },
           "title": "Quick"
        }

So your query needs to look into the attachment.content field and not the data one (which only serves the purpose of sending the raw content during indexing)

Modify your query to this and it will work:

POST /my_index/my_type/_search
{
   "query": {
      "match": {
         "attachment.content": {         <---- change this
            "query": "lorem"
         }
      }
   }
}

PS: Use POST instead of GET when sending a payload

Val
  • 207,596
  • 13
  • 358
  • 360
  • 1
    Any idea on how can we convert a pdf file to base64 encoded file using elastic search ? – Ashley Feb 13 '17 at 08:47
  • I think this should be a new question as it is unrelated to this one. – Val Feb 13 '17 at 08:49
  • Why did you use POST instead of GET ? The later works fine for me – Ashley Feb 13 '17 at 09:00
  • It depends on the HTTP client you're using, but you should **NEVER** send a payload via GET (= not HTTP compliant). See a more detailed example here: http://stackoverflow.com/questions/34795053/es-keeps-returning-every-document/34796014#34796014 – Val Feb 13 '17 at 09:13
  • @Ashley afaik that is done before data is sent to elastic search, so you do the conversion with whatever method exists in the programming language you are using. – Rui Marques Dec 23 '21 at 14:54