0

How can you get the text from a docx file in python? Preferably, this would import it to a simple string. Obviously formatting in the original file can be ignored.

I understand the structure of a docx file (a folder in which the text is saved as document.xml), but I would like a simple way of extracting the text, without having to manually open that folder, extract the file and extract paragraph tags.

I have tried Python Docx (as per this old stackoverflow question), but get an error everytime:

import docx as dx
document = dx.opendocx('files/file.docx')

Traceback (most recent call last):
  File "concord.py", line 2, in <module>
    document = dx.opendocx('files/#n01 ch B3A126.docx')
AttributeError: 'module' object has no attribute 'opendocx'
MatthewMartin
  • 32,326
  • 33
  • 105
  • 164
Zach
  • 4,624
  • 13
  • 43
  • 60
  • Do you happen to have a file named `docx.py` in the current directory? – Tim Pietzcker Sep 22 '12 at 16:18
  • No I don't have `docx.py` in the current working directory. However, there is such a file in `Python Docx` github release. To install it, all I did was extract it to a random folder (which I since deleted) and ran `python setup.py install`. Hope that's ok? – Zach Sep 22 '12 at 16:22
  • What do you get if you put `dir(dx)` right after the import? – Tim Pietzcker Sep 22 '12 at 16:25
  • If I do it in iPython I get: `Out[2]: ['AdvSearch', 'Image', '__builtins__', '__doc__', '__file__', '__name__', '__package__', 'advReplace', 'appproperties', 'clean', 'contenttypes', 'coreproperties', 'etree', 'findTypeParent', 'getdocumenttext', 'heading', 'join', 'log', 'logging', 'makeelement', 'newdocument', 'nsprefixes', 'opendocx', 'os', 'pagebreak', 'paragraph', 'picture', 're', 'relationshiplist', 'replace', 'savedocx', 'search', 'shutil', 'table', 'template_dir', 'time', 'websettings', 'wordrelationships', 'zipfile'] ` – Zach Sep 22 '12 at 16:31
  • If I do it as I originally was, in a text editor (notepad++) and run the file from the command line using `python filename.py`, I get `['__builtins__', '__doc__', '__file__', '__name__', '__package__', '__path__']`.. Not sure why I get different results running python and running it directly in iPython.. – Zach Sep 22 '12 at 16:34
  • Have a look at [this](http://davidmburke.com/2014/02/04/python-convert-documents-doc-docx-odt-pdf-to-plain-text-without-libreoffice/). The author of this blog has written 2 functions, one converts a odt/doc/docx to pdf, the next reads plain text from the thus created pdf. – snake_charmer Dec 13 '17 at 15:11

1 Answers1

0

If you don't mind discarding the formatting and you simply want to extract the text, you can open the .docx as a zip file and then strip the XML tags using regular expressions:

import re
import zipfile

def extract_text(filepath):
    with zipfile.ZipFile(filepath) as docx:
        content = docx.read('word/document.xml').decode('utf-8')
        return re.sub('<[^>]+>', '', content)