How can you get the text from a docx file in python? Preferably, this would import it to a simple string. Obviously formatting in the original file can be ignored.
I understand the structure of a docx file (a folder in which the text is saved as document.xml
), but I would like a simple way of extracting the text, without having to manually open that folder, extract the file and extract paragraph tags.
I have tried Python Docx (as per this old stackoverflow question), but get an error everytime:
import docx as dx
document = dx.opendocx('files/file.docx')
Traceback (most recent call last):
File "concord.py", line 2, in <module>
document = dx.opendocx('files/#n01 ch B3A126.docx')
AttributeError: 'module' object has no attribute 'opendocx'