Best way to extract text from a Word doc without using COM/automation?

Question

Is there a reasonable way to extract plain text from a Word file that doesn't depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platform - that's non-negotiable in this case.)

Antiword seems like it might be a reasonable option, but it seems like it might be abandoned.

A Python solution would be ideal, but doesn't appear to be available.

mikemaccana · Answer 1 · 2020-12-21T12:54:31.730

21

(Same answer as extracting text from MS word files in python)

Use the native Python docx module which I made this week. Here's how to extract all the text from a doc:

document = opendocx('Hello world.docx')

# This location is where most document content lives 
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]

# Extract all text
print getdocumenttext(document)

See Python DocX site

100% Python, no COM, no .net, no Java, no parsing serialized XML with regexs.

edited Dec 21 '20 at 12:54

answered Dec 30 '09 at 12:23

mikemaccana

110,530
99
389
494

Thank you very much for creating this library. I know you've posted this 3 years ago, but is there any way to sort of convert a DOCX document to HTML using your library? Cheers – Bo Milanovich Apr 20 '12 at 17:44
2

@mikemaccana can it parse .doc(not .docx) files also? – ofnowhere Jun 19 '14 at 14:33
Please ask about .doc files as separate question. – mikemaccana Dec 21 '20 at 12:55

score 16 · Accepted Answer · answered Sep 04 '08 at 08:52

16

I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).

import os

def doc_to_text_catdoc(filename):
    (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
    fi.close()
    retval = fo.read()
    erroroutput = fe.read()
    fo.close()
    fe.close()
    if not erroroutput:
        return retval
    else:
        raise OSError("Executing the command caused an error: %s" % erroroutput)

# similar doc_to_text_antiword()

The -w switch to catdoc turns off line wrapping, BTW.

answered Sep 04 '08 at 08:52

codeape

97,830
24
159
188

3

Note that python 3 removes popen3, see https://docs.python.org/3/library/subprocess.html#replacing-os-popen-os-popen2-os-popen3 – codeape Jul 04 '16 at 13:56
1

Worth noting that antiword doesn't work on the most recent docx formats. Their website states they support "Word 2, 6, 7, 97, 2000, 2002 and 2003" – Mr. T Aug 02 '21 at 08:51

Etienne · Answer 3 · 2014-01-15T19:13:22.637

If all you want to do is extracting text from Word files (.docx), it's possible to do it only with Python. Like Guy Starbuck wrote it, you just need to unzip the file and then parse the XML. Inspired by python-docx, I have written a simple function to do this:

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile


"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx <https://github.com/mikemaccana/python-docx>)
"""

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    xml_content = document.read('word/document.xml')
    document.close()
    tree = XML(xml_content)

    paragraphs = []
    for paragraph in tree.getiterator(PARA):
        texts = [node.text
                 for node in paragraph.getiterator(TEXT)
                 if node.text]
        if texts:
            paragraphs.append(''.join(texts))

    return '\n\n'.join(paragraphs)

Great piece of code! A small remark about your blog, if the background of the code was not black, that would be more readable. — Jean-Francois T., Nov 24 '15 at 05:58
Oh, thanks for the comment. The problem is that I 'hacked' a bit the Github CSS so the colors match my site. But when Github make changes to their CSS, I have to patch my stylesheet again, like right now. Not sure I'll keep this approach... — Etienne, Nov 24 '15 at 17:46

score 3 · Answer 4 · answered Sep 04 '08 at 07:45

Using the OpenOffice API, and Python, and Andrew Pitonyak's excellent online macro book I managed to do this. Section 7.16.4 is the place to start.

One other tip to make it work without needing the screen at all is to use the Hidden property:

RO = PropertyValue('ReadOnly', 0, True, 0)
Hidden = PropertyValue('Hidden', 0, True, 0)
xDoc = desktop.loadComponentFromURL( docpath,"_blank", 0, (RO, Hidden,) )

Otherwise the document flicks up on the screen (probably on the webserver console) when you open it.

score 2 · Answer 5 · answered Aug 18 '18 at 05:32

2

tika-python

A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.

Note: It also works charmingly with pyinstaller

Install with pip :

pip install tika

Sample:

#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file

Link to official GitHub

answered Aug 18 '18 at 05:32

Dhinesh kumar M

71
1
3

I tried your example and it seems it tries to download and start a Java `.jar` file: " Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.18/tika-server-1.18.jar" <-- but then it fails with HTTP 403. – Prof. Falken Aug 21 '18 at 15:02
1

Follow these steps 1. You can manually download tika from [here](http://www.apache.org/dyn/closer.cgi/tika/tika-server-1.18.jar) 2. Then open tika.py from \Lib\site-packages\tika folder and replace TikaJarPath = os.getenv('TIKA_PATH', "path\to\tika-server.jar\folder") `TikaJarPath = os.getenv('TIKA_PATH', "F:\Projects\python\tika")` – Dhinesh kumar M Aug 31 '18 at 13:03

score 1 · Answer 6 · 2009-09-07T14:55:20.570

1

For docx files, check out the Python script docx2txt available at

http://cobweb.ecn.purdue.edu/~kak/distMisc/docx2txt

for extracting the plain text from a docx document.

edited Sep 07 '09 at 14:55

answered Sep 06 '09 at 23:44

score 1 · Answer 7 · edited May 23 '17 at 12:02

1

This worked well for .doc and .odt.

It calls openoffice on the command line to convert your file to text, which you can then simply load into python.

(It seems to have other format options, though they are not apparenlty documented.)

edited May 23 '17 at 12:02

Community

1
1

answered May 08 '15 at 11:23

markling

1,232
1
15
28

openoffice and libreoffice are quite bad for deal with MS formats. – Tedo Vrbanec Mar 31 '19 at 18:42

score 1 · Answer 8 · answered Sep 03 '08 at 20:20

1

Open Office has an API

answered Sep 03 '08 at 20:20

Unsliced

10,404
8
51
81

score 0 · Answer 9 · answered Jul 18 '20 at 22:34

Honestly don't use "pip install tika", this has been developed for mono-user (one developper working on his laptop) and not for multi-users (multi-developpers).

The small class TikaWrapper.py bellow which uses Tika in command line is widely enough to meet our needs.

You just have to instanciate this class with JAVA_HOME path and the Tika jar path, that's all ! And it works perfectly for lot of formats (e.g: PDF, DOCX, ODT, XLSX, PPT, etc.).

#!/bin/python
# -*- coding: utf-8 -*-

# Class to extract metadata and text from different file types (such as PPT, XLS, and PDF)
# Developed by Philippe ROSSIGNOL
#####################
# TikaWrapper class #
#####################
class TikaWrapper:

    java_home = None
    tikalib_path = None

    # Constructor
    def __init__(self, java_home, tikalib_path):
        self.java_home = java_home
        self.tika_lib_path = tikalib_path

    def extractMetadata(self, filePath, encoding="UTF-8", returnTuple=False):
        '''
        - Description:
          Extract metadata from a document
        
        - Params:
          filePath: The document file path
          encoding: The encoding (default = "UTF-8")
          returnTuple: If True return a tuple which contains both the output and the error (default = False)
        
        - Examples:
          metadata = extractMetadata(filePath="MyDocument.docx")
          metadata, error = extractMetadata(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
        '''
        cmd = self._getCmd(self._cmdExtractMetadata, filePath, encoding)
        out, err = self._execute(cmd, encoding)
        if (returnTuple): return out, err
        return out

    def extractText(self, filePath, encoding="UTF-8", returnTuple=False):
        '''
        - Description:
          Extract text from a document
        
        - Params:
          filePath: The document file path
          encoding: The encoding (default = "UTF-8")
          returnTuple: If True return a tuple which contains both the output and the error (default = False)
        
        - Examples:
          text = extractText(filePath="MyDocument.docx")
          text, error = extractText(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
        '''
        cmd = self._getCmd(self._cmdExtractText, filePath, encoding)
        out, err = self._execute(cmd, encoding)
        return out, err

    # ===========
    # = PRIVATE =
    # ===========

    _cmdExtractMetadata = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --metadata ${FILE_PATH}"
    _cmdExtractText = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --encoding=${ENCODING} --text ${FILE_PATH}"

    def _getCmd(self, cmdModel, filePath, encoding):
        cmd = cmdModel.replace("${JAVA_HOME}", self.java_home)
        cmd = cmd.replace("${TIKALIB_PATH}", self.tika_lib_path)
        cmd = cmd.replace("${ENCODING}", encoding)
        cmd = cmd.replace("${FILE_PATH}", filePath)
        return cmd

    def _execute(self, cmd, encoding):
        import subprocess
        process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        out, err = process.communicate()
        out = out.decode(encoding=encoding)
        err = err.decode(encoding=encoding)
        return out, err

score 0 · Answer 10 · answered Sep 08 '20 at 08:42

0

Just in case if someone wants to do in Java language there is Apache poi api. extractor.getText() will extract plane text from docx . Here is the link https://www.tutorialspoint.com/apache_poi_word/apache_poi_word_text_extraction.htm

answered Sep 08 '20 at 08:42

Vishal Sanap

25
7

score 0 · Answer 11 · answered Jan 22 '22 at 20:21

Textract-Plus

Use textract-plus which can extract text from most of the document extensions including doc , docm , dotx and docx. (It uses antiword as a backend for doc files) refer docs

Install-

pip install textract-plus

Sample-

import textractplus as tp
text=tp.process('path/to/yourfile.doc')

score 0 · Answer 12 · answered Apr 21 '22 at 02:14

0

There is also pandoc the swiss-army-knife of documents. It converts from every format to nearly every other format. From the demos page

pandoc -s input_file.docx -o output_file.txt

answered Apr 21 '22 at 02:14

CpILL

6,169
5
38
37

score 0 · Answer 13 · edited Jun 10 '22 at 14:14

Like Etienne's answer. With python 3.9 getiterator was deprecated in ET, so you need to replace it with iter:


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    xml_content = document.read('word/document.xml')
    document.close()
    tree = XML(xml_content)

    paragraphs = []
    for paragraph in tree.iter(PARA):
        texts = [node.text
                 for node in paragraph.iter(TEXT)
                 if node.text]
        if texts:
            paragraphs.append(''.join(texts))

    return '\n\n'.join(paragraphs)

Best way to extract text from a Word doc without using COM/automation?

13 Answers13

Linked