0

I am developing a full text search engine for indexing popular binary formats. I know that there are hundereds of such questions (and solutions) already, but I found it tough to find one:

  • cross platform
  • supports DOC, DOCX and PDF formats at once
  • easy to use with python
  • can be set up in a major shared host
Jesvin Jose
  • 22,498
  • 32
  • 109
  • 202

4 Answers4

1
  • For PDFs, I recommend PDFminer.
  • Try the docx module (I have not used it myself)
  • I am not aware of any pure python module that can read .doc files.
  • There are command-line tools to extract text from .doc files: antiword and catdoc (and probably others). If the packages are installed on your shared host, you could use subprocess to shell out to these tools. Available on Windows via Cygwin.
  • Apache POI is a Java library that can extract text from Office documents. If your shared host has Java installed, you could write a bit of Java (or Jython) code and execute using subprocess.
codeape
  • 97,830
  • 24
  • 159
  • 188
0

Textract uses the default tools for every kind of file.

https://github.com/deanmalmgren/textract

enthus1ast
  • 2,099
  • 15
  • 22
0

If at server side you can use OpenOffice then you can use unoconv: Convert between any document format supported by OpenOffice

Michał Niklas
  • 53,067
  • 18
  • 70
  • 114
0

One possible solution is to use google documents to extract the text contents from binary .doc-files. You upload the document to google docs and then download the text contents. It is a fairly slow process, but it is the only "pure Python" solution I know of since it doesn't require any external tools except for network access. An external tool such as catdoc or antiword is a much better solution if you are allowed to install it on your host.

Björn Lindqvist
  • 19,221
  • 20
  • 87
  • 122