1

Description

I am using texttract python library to extract word document text. The problem is that: if the path contains arabic characters, then, antiword outputs that can't read the document.

Example

import textract

# path = 'C:\\test-docs\\info.doc'
path = 'C:\\مجلدات اختبارية\\info.doc'
text = textract.process(path, encoding='UTF-8')

print(text)

Error

Traceback (most recent call last):
  File "c:\test-extract-doc.py", line 5, in <module>
    text = textract.process(path, encoding='UTF-8')
  File "C:\Users\mohja\AppData\Local\Programs\Python\Python39\lib\site-packages\textract\parsers\__init__.py", line 77, in process 
    return parser.process(filename, encoding, **kwargs)
  File "C:\Users\mohja\AppData\Local\Programs\Python\Python39\lib\site-packages\textract\parsers\utils.py", line 46, in process    
    byte_string = self.extract(filename, **kwargs)
  File "C:\Users\mohja\AppData\Local\Programs\Python\Python39\lib\site-packages\textract\parsers\doc_parser.py", line 9, in extract
    stdout, stderr = self.run(['antiword', filename])
  File "C:\Users\mohja\AppData\Local\Programs\Python\Python39\lib\site-packages\textract\parsers\utils.py", line 100, in run       
    raise exceptions.ShellError(
textract.exceptions.ShellError: The command `antiword C:\مجلدات اختبارية\info.doc` failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b"I can't find the name of your HOME directory\r\nI can't open 'C:\\?????? ????????\\info.doc' for reading\r\n"

Notes

  • The process is working fine if I use .docx documents.
  • If I use a directory name without arabic charcters it also works for .doc documents.
mohjak
  • 101
  • 2
  • 11
  • What happens when you call `open('C:\\مجلدات اختبارية\\info.doc')`? – Tomalak May 10 '21 at 11:49
  • It works fine without any errors and this is the output `<_io.TextIOWrapper name='C:\\مجلدات اختبارية\\info.doc' mode='r' encoding='cp1254'>` – mohjak May 10 '21 at 11:53
  • *"The process is working fine if I use .docx documents."* - after looking at [the source code of `textract`](https://github.com/deanmalmgren/textract/tree/master/textract/parsers) it's clear that that's because textract uses different parsers for .doc and .docx. It uses [`docx2txt`](https://github.com/ankushshah89/python-docx2txt) for .docx, and for .doc it uses [`antiword`](http://www.winfield.demon.nl/). The latter seems to have an internal problem with Unicode paths. You could find out which version of antiword is on your system, and check if a newer version is available. – Tomalak May 10 '21 at 14:25
  • ..but given the age of `antiword` (the latest version appears to be from 2005), you might be out of luck there. It would be an option use a different tool to extract text from `.doc` files and rewire textract's `doc_parser.py` accordingly, or maybe you could check for a different multi-format text extraction utility entirely. [Apache Tika](https://tika.apache.org/) comes to mind. – Tomalak May 10 '21 at 14:29
  • And as yet another alternative, if you want to stick with textract, you could [convert `.doc` paths to their "short filename" variant](https://stackoverflow.com/questions/23598289/how-to-get-windows-short-file-name-in-python) before passing them to `textract.process()`. Short filenames are what you see when you run `dir /x` in the console, and they are guaranteed to be "ANSI-characters only" and therefore they will usable by `antiword`. Although all bets are off what output `antiword` produces from an Arabic document, chances are that's all question marks again. – Tomalak May 10 '21 at 14:41
  • 1
    The short filename works for me. For the document text character encoding when I need to oputput the text I used `text.decode(encoding='UTF-8', errors='strict')`. If you don't mind please consider to add it as an answer. Thank you very much! – mohjak May 11 '21 at 12:17

1 Answers1

1

After digging into the source code of textract, it becomes clear that for extraction from .doc the (ancient) command line tool antiword is used.

class Parser(ShellParser):
    """Extract text from doc files using antiword.
    """

    def extract(self, filename, **kwargs):
        stdout, stderr = self.run(['antiword', filename])
        return stdout

Python does everything properly, but apparently antiword itself has issues with the way it parses its arguments, at least on Windows, so passing a Unicode path results in breakage.

Luckily Windows offers a way of converting any path into a backwards-compatible form of ANSI-only 8.3 filenames - the so-called "short" paths, which can be requested from the system with a Win32 API call. Short paths and regular ("long") paths are interchangeable, but legacy software might like short paths better.

This provides a work-around: Retrieve the short path for any .doc file and give that to antiword instead. Win32 API calls are supplied in Python by the win32api module:

from win32api import GetShortPathName 

def extract_text(path):
    if path.lower().endswith(".doc"):
        path = GetShortPathName(path)

    return textract.process(path, encoding='UTF-8')

text = extract_text('C:\\مجلدات اختبارية\\info.doc')
print(text)
dataninsight
  • 1,069
  • 6
  • 13
Tomalak
  • 332,285
  • 67
  • 532
  • 628