i want to perform a content search based on a keyword/phrase which the user types and return the entry which contains those keywords/phrases. the document on which i want to perform search are stored in postgressql as binary data.
Asked
Active
Viewed 1,038 times
1
-
What tool / API you using for indexing? – Sabir Khan Oct 08 '16 at 10:52
-
i am using lucene 3.6.1, which produces indexed files in a folder. i want to try and use it later on during search. is this possible? try and store those indexed files into database and look up in that column during search . @SabirKhan – ExTincT Oct 08 '16 at 10:54
-
Very purpose of creating indices is to search them later on. Lucene does the same and I recommend to use Lucene 6.0.0 or higher.Lucene stores to disk not to rdbms. Storing indices to rdbms would not be possible. – Sabir Khan Oct 08 '16 at 10:58
-
See [this](http://stackoverflow.com/questions/35725908/can-i-store-a-lucene-index-in-a-database-or-other-location-than-a-file-system). You have to follow steps as pointed in Sky's answer. – Sabir Khan Oct 08 '16 at 11:05
-
If its no problem for you to set up an additional nosql server and store the text also there you should take a look at elasticsearch. – Simon Ludwig Oct 08 '16 at 11:07
1 Answers
1
First step would be to get readable text out of your binary files. A good library for reading text out of various file types is Apache TIKA.
Once you got readable text out of your documents, you'd need to store this text in PostgreSQL together with some reference to your original binary documents and use PostgreSQL's full text search capabilities for searching: https://www.postgresql.org/docs/9.6/static/textsearch.html
An alternative to the database search functionality would be something like Apache Lucene. I've got pretty cool results with Apache Lucene so far.

Sky
- 674
- 8
- 22
-
i tried lucene.it produces index files which are any in number.can i store those files in database and later use them for searching? is this possible? @sky – ExTincT Oct 08 '16 at 10:52
-
Sorry, I've referenced the wrong library. What I recommend for reading text from various file types is Apache TIKA - it's great for the job (edited my answer). An no, I don't think it makes sense to store the Lucene index in the database - I guess Lucene couldn't access it there anyways. The index should be kept on the file system. Is there a particular reason why you'd like to have it in the database? – Sky Oct 08 '16 at 11:06
-
i wanted it that way because i am storing uploaded documents in the database as binary data.So while searching it would have been pretty much easier if i was able to directly search in binary data(if that is possible) @sky – ExTincT Oct 08 '16 at 11:14
-
I don't think this is possible, simply by the fact that binary data can be anything and every kind of binary file has a different way to get text out of it. That's why a library like Apache TIKA is so useful (it can read a lot of binary file formats). Once you've got text out of the file, you can use Lucene or database full text search, and so on, to search in the text. But I think you really need to extract text from the binary files first. – Sky Oct 08 '16 at 11:19
-
hey @sky.If suppose i save my document as text and then run a full text search on my table in postgress will this be effective ? – ExTincT Oct 10 '16 at 04:29
-
@ExTincT I guess so, yes. However, I didn't use PostgreSQL full text search so far because I had to develop database independently and therefore preferred Lucene. – Sky Oct 10 '16 at 06:04
-
@ExTinct I just saw that there are ways to store a Lucene index in a database, see [this post](http://stackoverflow.com/questions/17359851/create-lucene-index-in-database-using-jdbcdirectory?rq=1). However, storing the index in the database still requires to first extract text from binary files and let Lucene index the text data. So I think it doesn't give you advantages. – Sky Oct 10 '16 at 06:11