2

For a DOCX document I do:

document = zipfile.ZipFile(path)
soup = BeautifulSoup(document.read('word/document.xml'), 'html.parser')

How to do this for DOC document?

kjhughes
  • 106,133
  • 27
  • 181
  • 240
sandboxj
  • 1,234
  • 3
  • 21
  • 47
  • 1
    The `.doc` file format is *not* XML-based - not clear what you're expecting to get here..... – marc_s Dec 01 '19 at 14:31
  • I am trying to get highlighted text from word documents and am able to get which `w:r` are highlighted and in what color from the `xml`of the `.docx`. I want to do the same for `.doc`. Is there a way to not only get the string from the `.doc` but also the 'markup'/structure behind it? – sandboxj Dec 01 '19 at 15:24
  • You are in effect asking for a library to interpret the proprietary .doc format. –  Dec 01 '19 at 21:20

1 Answers1

3

You don't.

DOCX are tough enough to process, and they're XML-based and documented by international standards organizations. DOC files are binary and proprietary.

Don't try to process DOC files directly. Convert them to DOCX first.

See:

kjhughes
  • 106,133
  • 27
  • 181
  • 240