solution to convert PDFs, DOCs, DOCXs into a textual format with python

Question

I am developing a full text search engine for indexing popular binary formats. I know that there are hundereds of such questions (and solutions) already, but I found it tough to find one:

cross platform
supports DOC, DOCX and PDF formats at once
easy to use with python
can be set up in a major shared host

codeape · Accepted Answer · 2011-07-28T10:40:08.073

1

For PDFs, I recommend PDFminer.
Try the docx module (I have not used it myself)
I am not aware of any pure python module that can read .doc files.
There are command-line tools to extract text from .doc files: antiword and catdoc (and probably others). If the packages are installed on your shared host, you could use subprocess to shell out to these tools. Available on Windows via Cygwin.
Apache POI is a Java library that can extract text from Office documents. If your shared host has Java installed, you could write a bit of Java (or Jython) code and execute using subprocess.

edited Jul 28 '11 at 10:40

answered Jul 28 '11 at 07:41

codeape

97,830
24
159
188

According to an edit suggestion, the author used the docx module. – Tomas Aschan Jul 28 '11 at 15:23

score 0 · Answer 2 · answered Aug 15 '14 at 12:49

0

Textract uses the default tools for every kind of file.

https://github.com/deanmalmgren/textract

answered Aug 15 '14 at 12:49

enthus1ast

2,099
15
22

score 0 · Answer 3 · answered Jul 28 '11 at 08:18

0

If at server side you can use OpenOffice then you can use unoconv: Convert between any document format supported by OpenOffice

answered Jul 28 '11 at 08:18

Michał Niklas

53,067
18
70
114

score 0 · Answer 4 · answered Jul 28 '11 at 12:13

One possible solution is to use google documents to extract the text contents from binary .doc-files. You upload the document to google docs and then download the text contents. It is a fairly slow process, but it is the only "pure Python" solution I know of since it doesn't require any external tools except for network access. An external tool such as catdoc or antiword is a much better solution if you are allowed to install it on your host.

solution to convert PDFs, DOCs, DOCXs into a textual format with python

4 Answers4

Linked

Related