35

I am new to ElasticSearch. I have gone through very basic tutorial on creating Indexes. I do understand the concept of a indexing. I want ElasticSearch to search inside a .PDF File. Based on my understanding of creating Indexes, it seems I need to read the .PDF file and extract all the keywords for indexing. But, I do not understand what steps I need to follow. How do I read .PFD file to extract keywords.

KurioZ7
  • 6,028
  • 13
  • 46
  • 65
  • 2
    You probably need to check out the [elasticsearch-mapper-attachments plugin](https://github.com/elastic/elasticsearch-mapper-attachments), it should do what you expect. – Val Jan 18 '16 at 14:39
  • If you want out-of-the-box solution you could try [Ambar](https://ambar.cloud) – Ilia P May 03 '17 at 08:20

6 Answers6

54

It seems that the elasticsearch-mapper-attachment plugin has been deprecated in 5.0.0 (Released Oct. 26th, 2016). The documentation recommends using the Ingest Attachment Processor Plugin as a replacement.

To install:

sudo bin/elasticsearch-plugin install ingest-attachment

See How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin? for information on how to use the Ingest Attachment plugin.

Ben.12
  • 754
  • 1
  • 10
  • 15
  • 5
    This is the correct answer as of today (11/18/2016). elasticsearch-mapper-attachments is outdated and does not work with elasticsearch >= 5.0.0 but `ingest-attachment` works like a charm. – Kevin G. Nov 18 '16 at 16:07
12

You need to check out the elasticsearch-mapper-attachments plugin, as it is very likely to help you achieve what you need.

UPDATE:

This above plugin has been superseded by the ingest attachment processor plugin in ES 5.0

Val
  • 207,596
  • 13
  • 358
  • 360
6

Install Elasticsearch mapper-attachment plugin and use code similar to :

public String indexDocument(String filePath, DataDTO dto) {
        IndexResponse response = null;
        try {
            response = this.prepareIndexRequest("collectionName").setId(dto.getId())
                    .setSource(jsonBuilder().startObject()
                    .field("file", Base64.encodeFromFile(filePath))
                    .endObject()).setRefresh(true).execute().actionGet();
        } catch (ElasticsearchException e) {
            //
        } catch (IOException e) {
            //
        }
    return response.getId();
}
sanesearch
  • 81
  • 1
  • 4
5

As mentioned elasticsearch-mapper-attachment plugin has been deprecated and instead Ingest Attachment plugin can be used

https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

fakturk
  • 415
  • 7
  • 16
2

For my project I also had to make my local .PDF files to be searchable. I achieved this by following :

  1. Extracted data from .PDF file using Apache Tika , I used Apache Tika because it gives me freedom for extracting data from different extensions with same pipeline.
  2. Used the output of Apache Tika for Indexing.

Usually my index looked like :

{ filename : "FILENAME", filebody : "Data extracted from Apache Tika" }


There are multiple different solutions out there as mentioned here also using Elasticsearch mapper-attachment plugin is a great solution. I opted for this approach because I wanted to work with large files and different extensions.

0

I found below code here Pdf to elastic search, the code extracts pdf and put into elastic search

import PyPDF2
import re
import requests
import json
import os
from datetime import date

class ElasticModel:

    name = ""
    msg = ""

    def toJSON(self):
        return json.dumps(self, default=lambda o: o.__dict__, 
            sort_keys=True, indent=4)

def __readPDF__(path):
    # pdf file object
    # you can find find the pdf file with complete code in below
    pdfFileObj = open(path, 'rb')
    # pdf reader object
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    # number of pages in pdf
    print(pdfReader.numPages)
    # a page object
    pageObj = pdfReader.getPage(0)
    # extracting text from page.
    # this will print the text you can also save that into String
    line = pageObj.extractText() 
    line = line.replace("\n","")
    print(line)
    return line


#line = pageObj.extractText()

def __prepareElasticModel__(line, name):
    eModel = ElasticModel();

    eModel.name = name
    eModel.msg = line
    return eModel


def __sendToElasticSearch__(elasticModel):
    print("Name : " + str(eModel))

############################################
####  #CHANGE INDEX NAME IF NEEDED
#############################################
    index = "samplepdf"

    url = "http://localhost:9200/" + index +"/_doc?pretty"
    data = elasticModel.toJSON()
    #data = serialize(eModel)
    response = requests.post(url, data=data,headers={
                    'Content-Type':'application/json',
                    'Accept-Language':'en'

                })
    print("Url : " + url)
    print("Data : " + str(data))

    print("Request : " + str(requests))
    print("Response : " + str(response))


#################################
#Change pdf dir path
###################################
pdfdir = "C:/Users/abhis/Desktop/TemplatesPDF/SamplePdf"

listFiles = os.listdir(pdfdir)
for file in listFiles :
    path = pdfdir + "/" + file
    print(path)

    line = __readPDF__(path)
    eModel = __prepareElasticModel__(line, file)
    __sendToElasticSearch__(eModel)

The above code is extracting the sample pdf

enter image description here

From above sample pdf, few fields (Such as Name and Msg) has been extracted using regex and inserted into elastic search, Hope this would help

Abhishek
  • 1,337
  • 10
  • 29