4

Also I want to know how to add meta data while indexing so that i can boost some parameters

harsha
  • 131
  • 2
  • 7

4 Answers4

4

There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)

  • One of them is Apache Tika, a sub-project of Lucene.
  • Apache POI is a more general document handling project inside Apache.
  • There are also some commercial alternatives.
Yuval F
  • 20,565
  • 5
  • 44
  • 69
3

You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Supported Document Formats

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document formats
  • OpenDocument Format
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image formats
  • Video formats
  • Java class files and archives
  • The mbox format

The code will look like this. Reader reader = new Tika().parse(stream);

1

see https://github.com/WolfgangFahl/pdfindexer for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text, index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.

Wolfgang Fahl
  • 15,016
  • 11
  • 93
  • 186
1

Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.

Michael Shimmins
  • 19,961
  • 7
  • 57
  • 90