-2

Possible Duplicate:
solution to convert PDFs, DOCs, DOCXs into a textual format with python

I am making a document search engine which indexes popular binary formats. I am looking for python libraries for this purpose.

Reliable converters proved too hard to find. PyPDF never works accurately. Please reccomend:

  • python libraries that convert these formats to text
  • or cross-platform, standalone programs that can be called as a subprocess
Community
  • 1
  • 1
Jesvin Jose
  • 22,498
  • 32
  • 109
  • 202

2 Answers2

1
Community
  • 1
  • 1
Katriel
  • 120,462
  • 19
  • 136
  • 170
1

You might try Open Office.

It's converting skills are above average. For editing PDF documents, you need to install the pdf import extension.

There are some extensions to work with python, such as the python-uno bridge, but I've had difficulty with it, and generally resort to calling open office as a subprocess.

Just noticed you opened a duplicate question at: solution to convert PDFs, DOCs, DOCXs into a textual format with python...

Community
  • 1
  • 1
Adam Morris
  • 8,265
  • 12
  • 45
  • 68