How to get XML from DOC (not DOCX)?

Question

For a DOCX document I do:

document = zipfile.ZipFile(path)
soup = BeautifulSoup(document.read('word/document.xml'), 'html.parser')

How to do this for DOC document?

The `.doc` file format is *not* XML-based - not clear what you're expecting to get here..... — marc_s, Dec 01 '19 at 14:31
I am trying to get highlighted text from word documents and am able to get which `w:r` are highlighted and in what color from the `xml`of the `.docx`. I want to do the same for `.doc`. Is there a way to not only get the string from the `.doc` but also the 'markup'/structure behind it? — sandboxj, Dec 01 '19 at 15:24
You are in effect asking for a library to interpret the proprietary .doc format. — , Dec 01 '19 at 21:20

kjhughes · Accepted Answer · 2019-12-01T15:36:42.897

You don't.

DOCX are tough enough to process, and they're XML-based and documented by international standards organizations. DOC files are binary and proprietary.

Don't try to process DOC files directly. Convert them to DOCX first.

See:

How to get XML from DOC (not DOCX)?

1 Answers1

You don't.