Note: This was flagged as a potential duplicate of this, but the point of my question is that using textract doesn't work. I am looking either for (a) a way to get textract to work on windows 10 or (b) an alternate solution.
I am building a system that needs to read various types of files. I have set up pdfminer to read the .pdfs, and based on the process outlined here I installed textract, and I can now also read .docx files. However textract relies on antiword for reading .doc files and I cannot get this to work, even after following the directions here I could not find and install a working version of antiword. I do not have microsoft word installed on my machine, and I am running windows 10 with python 3.6.5. Is there any other way to read .doc files?
Here is the bug when running textract.process('d.doc') (ignore the first error, the file is definitely there):
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\utils.py", line 84, in run
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 709, in __init__
restore_signals, start_new_session)
File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 997, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\__init__.py", line 77, in process
return parser.process(filename, encoding, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\doc_parser.py", line 9, in extract
stdout, stderr = self.run(['antiword', filename])
File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\utils.py", line 91, in run
' '.join(args), 127, '', '',
textract.exceptions.ShellError: The command antiword d.doc failed with exit code 127