How to index files such as .txt,.pdf,.doc etc using lucene.net?

Question

I am new to Lucene .net.How to index files such as .txt,.pdf,.doc etc using lucene.net?and what all files we can index using lucene.net?

What articles have you looked at? – agent-j Jun 01 '12 at 19:04 — agent-j, Jun 01 '12 at 19:04

score 2 · Answer 1 · answered Jun 05 '12 at 16:25

Lucene.net is agnostic to indexing particular files. You must index the files yourself.

I would use IFilters to pull out the text in a document and then use Lucene.net to create the search index.

you can search codeproject.com for multiple articles about using IFilters & lucene.net

score 0 · Answer 2 · edited May 23 '17 at 12:00

Before you index files you need to extract text from them in a proper way. Lucene or Lucene.net don't do that. For text extraction you can use IFilter in windows. IFilters may not be stable and you need to use COM which has threading issues. In addition, using different ifilters with different versions of documents is a real trouble.

http://www.codeproject.com/Articles/13391/Using-IFilter-in-C

www.ifilter.org

There are commercial alternatives for text extraction but they are really expensive.

http://www.isys-search.com/products/document-filters

http://www.oracle.com/us/technologies/embedded/025613.htm

Apache Tika is a good open source alternative for commercial ones. It is in Java.

http://tika.apache.org/

I strongly recommend to use Apache Solr/Lucene with a good Solr .NET client instead of Lucene.net. Solr has Tika integration built-in that will achieve what you want to do. You don't need to know Java in order to use Solr. It is a standalone web service that can run on a lightweight application server.

If you build a document search solution with Lucene.Net you will have many problems which have already been addressed in Solr.

http://www.lucidimagination.com/devzone/technical-articles/content-extraction-tika

http://wiki.apache.org/solr/ExtractingRequestHandler

There is good discussion about Lucene vs Solr here.

Search Engine - Lucene or Solr

How to index files such as .txt,.pdf,.doc etc using lucene.net?

2 Answers2

Linked