How do I index pdf document content in elastic search,?

Question

I am trying to index documents (read Pdf for ex) into elastic search.
My objective is to search documents based on matching content string.
To extract the document content, I am using Apache Tika .
I am not sure how should i index the document content along with document meta-data.

Below are the options i can think of:

Should i just add one field "content" having data type as String and simply store the document content as string there? (But not sure it will work for big size documents)

or I should make that field binary and encode the document content there. (But it will not be searchable)

Please advise.

score 1 · Answer 1 · answered Oct 19 '16 at 12:34

It all depends on whether you can structurize the content or not. For example, if you are going to store invoices (incoming PDF files) you could set some patterns to find company names, addresses, items, prices, VAT, etc. and store this data in clean JSON form. Searches will be fast and storage efficient.

On the other hand, you could be storing some random content (or you don't know what the content will be). In that situation you should just read all data you can read into a content string and store it "as is". You will still get fulltext search (by keywords and phrases) but no structural search nor ordering (companyName=ABC).

In both cases I would store the initial binary file somewhere on filesystem (like my-uid-string.pdf) and serve it as a simple file when needed. I prefer not to store binary data in databases even though most of them have the ability to do it.

My use case is 2nd one, random content. I don't know what a particular file will have, but if it has some text, then i want to store it in elastic search and be searchable based on those texts. Even for the fulltext search, it is required to store the document content in elastic search, and that is my concern. How should i store the document content ? because if i store it as string, then it might have issues for big documents. — AKS, Oct 20 '16 at 20:39
Large strings can be stored by Elastic, but can't be indexed in Lucene. Although, there is a mechanism to automatically interpret large strings as a set of short ones. You can start your research here: http://stackoverflow.com/a/28831582/5848808 - good luck! — Boris Schegolev, Oct 21 '16 at 09:01

How do I index pdf document content in elastic search,?

1 Answers1