Python textract ImportError

Question

I have begun using the Python library textract to parse text from PowerPoint (.pptx), Word documents (.docx), and text files (*.txt). I wrote a simple script to test it.

# Python textract test script
import textract
textract.process("H:\My Documents\Test.docx")

When I run it, either on the command line or in Idle, I get a traceback with the last few lines being:

File: "C:...\textract\parsers\docx_parser.py", line 1 in import docx2txt ImportError: No module named docx2txt

I am using version 1.5.0, downloaded from https://pypi.python.org/pypi/textract. I don't know why it would not include any dependencies. Will I have to install docx2txt and its subsequent dependencies? Why would the textract package not contain everything I need?

You tried downloading docx2txt ? – Quartal May 31 '17 at 22:13 — Quartal, May 31 '17 at 22:13

score 1 · Answer 1 · answered May 13 '19 at 17:37

This worked for me,

open the terminal and then type them as below,

python -m venv env 
source ./env/bin/activate
sudo apt update
sudo apt install python-pip && pip install --upgrade pip
sudo apt install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig
pip install textract

if you face any errors try them below

pip install https://pypi.python.org/packages/ce/c7/ab6cd0d00ddf8dc3b537cfb922f3f049f8018f38c88d71fd164f3acb8416/SpeechRecognition-3.6.3-py2.py3-none-any.whl
sudo apt install libpulse-dev
pip install textract

score 0 · Answer 2 · answered May 31 '17 at 22:22

I would recommend using pip install xxx to install the module. That'll install it in the path that's usually looked up by python. It should also take care of dependencies.

If you did manual installation or just extracted it to dinner folder then Set your path correctly, like described here How to add to the pythonpath in windows 7? or Python - PYTHONPATH in linux

If you think you've set it correctly then then post it's value, pwd etc.

score 0 · Answer 3 · answered May 31 '17 at 23:07

textract does not automatically install the dependencies for all the file types it supports. You selectively install the ones you're interested in.

While this is not as elegant as one might imagine it could be, it's the appropriate design choice here I think. Python doesn't have the ability to install dependencies on-demand, so the only alternative would be for textract to install all the dozen or more possible dependencies, which would tend to bloat your Python environment.

So in this case, as Kashyap mentions, the appropriate action is:

pip install python-docx

and similar for any other file type dependencies you might need.

Python textract ImportError

3 Answers3