I have begun using the Python library textract
to parse text from PowerPoint (.pptx), Word documents (.docx), and text files (*.txt). I wrote a simple script to test it.
# Python textract test script
import textract
textract.process("H:\My Documents\Test.docx")
When I run it, either on the command line or in Idle, I get a traceback with the last few lines being:
File: "C:...\textract\parsers\docx_parser.py", line 1 in import docx2txt ImportError: No module named docx2txt
I am using version 1.5.0, downloaded from https://pypi.python.org/pypi/textract. I don't know why it would not include any dependencies. Will I have to install docx2txt
and its subsequent dependencies? Why would the textract
package not contain everything I need?