Python open .doc file

Question

I'm working on a project in which I need to read the text from multiple doc and docx files. The docx files were easily done with the docx2txt module but I cannot for the love of me make it work for doc files. I've tried with textract, but it doesn't seem to work on Windows. I just need the text in the file, no pictures or anything like that. Any ideas?

Does this answer your question? [Read .doc file with python](https://stackoverflow.com/questions/36001482/read-doc-file-with-python) — user202729, Jun 09 '20 at 12:10
It is not easy to do. `textract` can do it if you have antiword installed. Tika can extract the text, but not the formatting. — erip, Jun 09 '20 at 12:10

score 0 · Answer 1 · answered Jun 09 '20 at 12:17

0

I found that this seems to work:

import win32com.client
text = win32com.client.Dispatch("Word.Application")
text.visible = False
wb = text.Documents.Open("myfile.doc")
document = text.ActiveDocument
print(document.Range().Text)

answered Jun 09 '20 at 12:17

Kehinde

16
3

This only works on Windows and with the Office suite installed. – erip Jun 09 '20 at 14:08
He mentioned that he is running his code on Windows. Thanks @erip – Kehinde Jun 09 '20 at 17:46

score 0 · Answer 2 · answered Dec 08 '22 at 16:20

I had a similar issue, the following function worked for me.

def get_string(path: Path) -> str:
    string = ''
    with open(path, 'rb') as stream:
        stream.seek(2560)
        current_stream = stream.read(1)
        
        while not (str(current_stream) == "b'\\x00'"):
            if str(current_stream) in special_chars.keys():
                string += special_chars[str(current_stream)]

            else:
                try:
                    char = current_stream.decode('UTF-8')
                    if char.isalnum() or char == ' ':
                        string += char
                except UnicodeDecodeError:
                    string += ''
            current_stream = stream.read(1)
    return string

I tested it on a .doc file looking like the following: picture of .doc file

The output from:

string = get_string(filepath)
print(string)

is:

The big red fox jumped over the small barrier to get to the chickens on the other side



And the chickens ran about but had no luck in surviving the day
this||||that||||The other||||

Python open .doc file

2 Answers2