Also I want to know how to add meta data while indexing so that i can boost some parameters
4 Answers
There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)
- One of them is Apache Tika, a sub-project of Lucene.
- Apache POI is a more general document handling project inside Apache.
- There are also some commercial alternatives.

- 20,565
- 5
- 44
- 69
You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Supported Document Formats
- HyperText Markup Language
- XML and derived formats
- Microsoft Office document formats
- OpenDocument Format
- Portable Document Format
- Electronic Publication Format
- Rich Text Format
- Compression and packaging formats
- Text formats
- Audio formats
- Image formats
- Video formats
- Java class files and archives
- The mbox format
The code will look like this. Reader reader = new Tika().parse(stream);

- 119
- 1
- 6
see https://github.com/WolfgangFahl/pdfindexer for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text, index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.

- 15,016
- 11
- 93
- 186
Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.

- 19,961
- 7
- 57
- 90