Using olefile to extract text from Word .doc

Question

I am only concerned with getting the text from .doc files. I am using python 3.6 on windows 10, so textract/antiword are off the table. I looked at other references from this question but they are all old and incompatible with windows 10 and/or python 3.6.

My document is a .doc file with a mix of Chinese and English. I am not familiar with how Word stores its files, and I don't have Word on my machine. Using olefile I was able to get the bytes of the document, but I do not know how to traverse the headers and layout correctly to extract the text. If I naively try

from olefile import OleFileIO as ofio
ole = ofio('d.doc')
stream = ole.openstream('WordDocument')
data = stream.read()
data.decode('utf-16')
>>>UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 9884-9885: illegal encoding
data[9884:9885]
>>>b'\xfa'
data[:9884].decode('utf-16')

Then the last line gives me about half the doc, starting and ending with a lot of garbage characters. I suspect I could keep trying this method to get the text piece-by-piece, but I ultimately need to do this for a lot of files. Even if I did it this way, I can't think of a good way to automate it. How can I reliably get the text from a .doc using olefile?

(Feel free to include alternatives to olefile in your answer as well, if you know of one that would work with my specs)

https://wiki.openoffice.org/wiki/PyUNO_samples might help, if you can install LibreOffice — Prof. Falken, Aug 21 '18 at 12:06
https://stackoverflow.com/a/26630266/193892 https://stackoverflow.com/a/30122239/193892 — Prof. Falken, Aug 21 '18 at 14:14
The problem you're encountering is that a *.doc file is in the old, proprietary binary file format, which means the content is encoded. There's no straight-forward way to extract the text from such a file. — Cindy Meister, Aug 21 '18 at 16:46

Prof. Falken · Accepted Answer · 2019-11-05T09:57:59.680

I am not sure, but I think that the problem is that olefile has no understanding of Word documents, only OLE "streams". So I would guess that your extracted data has more than plain text in, control characters of some kind. So I guess that's why you can't decode the data you get as UTF-16.

There are Python modules to convert from doc files, but they tend to work only on Linux where they make use of the command line utilities antiword or catdoc.

I tried other solutions - if the issue is that you have no license for Word, but can otherwise install software, LibreOffice could be a path forward. With this command, I converted a Word test file with Chinese letters from doc format to HTML:

"c:\Program Files\LibreOffice\program\swriter.exe" --convert-to html d.doc

LibreOffice can also convert to many other formats, but HTML should be simple enough to process further. I also tried a port of catdoc to Windows but I couldn't get it to handle the Chinese letters.

Too bad you don't have Word installed, or you could have made it do the work for you. Leaving that solution here in case someone else has use for it:

import win32com.client

app = win32com.client.Dispatch("Word.Application")

try:
    app.visible = False
    wb = app.Documents.Open('c:/temp/d.doc')
    doc = app.ActiveDocument

    with open('out.txt', 'w', encoding = 'utf-16') as f:
        f.write(doc.Content.Text)

except Exception as e:
    print(e)

finally:
    app.Quit()

Thanks, so far it is working, though it's a bit finicky with handling Chinese, English, and the pictures. Also with LibreOffice 6.1.0 on Windows 10 you need to specify outdir: .\soffice --headless --convert-to html --outdir C:\Users\windows\Desktop C:\Users\windows\Desktop\d.doc — tigerninjaman, Aug 22 '18 at 02:29
@tigerninjaman, nice to hear! Since you are going for this solution, beware that the call to `swriter.exe` (and I guess `soffice.exe` too) is not blocking - the application completes from the DOS window *before* an output file is produced. It continues running in the background until the conversion is complete. Also, if the HTML is finicky to parse, it might be worthwhile looking into all the other output formats LibreOffice can produce, such as doc**x**, RTF, Abiword format and DocBook https://en.wikipedia.org/wiki/DocBook For HTML parsing: https://www.crummy.com/software/BeautifulSoup/ — Prof. Falken, Aug 22 '18 at 12:35

Using olefile to extract text from Word .doc

1 Answers1

Linked