Description
I am using texttract python library to extract word document text. The problem is that: if the path contains arabic characters, then, antiword outputs that can't read the document.
Example
import textract
# path = 'C:\\test-docs\\info.doc'
path = 'C:\\مجلدات اختبارية\\info.doc'
text = textract.process(path, encoding='UTF-8')
print(text)
Error
Traceback (most recent call last):
File "c:\test-extract-doc.py", line 5, in <module>
text = textract.process(path, encoding='UTF-8')
File "C:\Users\mohja\AppData\Local\Programs\Python\Python39\lib\site-packages\textract\parsers\__init__.py", line 77, in process
return parser.process(filename, encoding, **kwargs)
File "C:\Users\mohja\AppData\Local\Programs\Python\Python39\lib\site-packages\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:\Users\mohja\AppData\Local\Programs\Python\Python39\lib\site-packages\textract\parsers\doc_parser.py", line 9, in extract
stdout, stderr = self.run(['antiword', filename])
File "C:\Users\mohja\AppData\Local\Programs\Python\Python39\lib\site-packages\textract\parsers\utils.py", line 100, in run
raise exceptions.ShellError(
textract.exceptions.ShellError: The command `antiword C:\مجلدات اختبارية\info.doc` failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b"I can't find the name of your HOME directory\r\nI can't open 'C:\\?????? ????????\\info.doc' for reading\r\n"
Notes
- The process is working fine if I use
.docx
documents. - If I use a directory name without arabic charcters it also works for
.doc
documents.