2

My end objective is to index documents using lucene. As lucene doesnt support indexing other formats. I want to convert these files to txt/html (lucene indexable file types). I have a set of documents almost 1000 files of ppt, pdf, doc, xl etc Please help me

harsha
  • 131
  • 2
  • 7
  • 1
    I believe this is a duplicate of http://stackoverflow.com/questions/2582951/how-to-index-pdf-ppt-xl-files-in-lucene-java-based-or-python-or-php-any-of-the . Please see my answer to that question. – Yuval F Apr 14 '10 at 09:39

2 Answers2

1

You could use OpenOffice headless to convert the files from one format to another, say Excel/Doc to TXT/HTML.

We use a similar process combined with ImageMagick to allow people to upload office documents into a presentation app.

Below are a few examples/tutorials on how to achieve this:

Setup OpenOffice

http://code.google.com/p/openmeetings/wiki/OpenOfficeConverter

JOD Converter (Java)

http://artofsolving.com/opensource/jodconverter

PyOD Converter (Python)

http://artofsolving.com/opensource/pyodconverter

If you need any further help with OOo feel free to ask

Good luck :)

jahilldev
  • 3,520
  • 4
  • 35
  • 52
-1

You now (2022) have a python opensource that does this: https://github.com/shakiyam/pptx2txt

Uriel
  • 182
  • 3
  • 10