How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

Question

Also I want to know how to add meta data while indexing so that i can boost some parameters

score 4 · Answer 1 · answered Apr 06 '10 at 07:56

There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)

One of them is Apache Tika, a sub-project of Lucene.
Apache POI is a more general document handling project inside Apache.
There are also some commercial alternatives.

Sergii Kabashniuk · Answer 2 · 2010-04-16T14:10:02.013

You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Supported Document Formats

HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format

The code will look like this. Reader reader = new Tika().parse(stream);

score 1 · Answer 3 · answered May 12 '13 at 07:44

see https://github.com/WolfgangFahl/pdfindexer for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text, index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.

score 1 · Accepted Answer · answered Apr 06 '10 at 06:11

1

Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.

answered Apr 06 '10 at 06:11

Michael Shimmins

19,961
7
57
90

How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

4 Answers4

Linked

Related