12

I receive word documents with specified formating corresponding to the data that is in them. For example, all headers have the exact same formating (Times New Roman-Font 14-Bold).

What is the best way to process such MS Word documents (.doc or .docx) into xml documents? Language is not an issue (I'll use Lisp/Boost.Spirit if I have to!).

Mikhail
  • 7,749
  • 11
  • 62
  • 136
  • Could you elaborate on how the xml elements would be generated from the word docs? If it's purely text based I would look at converting them to plain text first. – Peter Gibson Nov 24 '10 at 02:20
  • See http://bytes.com/topic/python/answers/24103-parsing-ms-word-document – Rafe Kettler Nov 24 '10 at 02:35
  • See http://stackoverflow.com/questions/125222/extracting-text-from-ms-word-files-in-python – fmark Nov 24 '10 at 02:43
  • Also read this very insightful article by Joel: [Why are the MS Office file formats so complicated? (And some workarounds)](http://www.joelonsoftware.com/items/2008/02/19.html) – Tim Pietzcker Nov 24 '10 at 07:39

5 Answers5

10

Take a look at the python-docx library.

Etienne
  • 12,440
  • 5
  • 44
  • 50
3

So I think you're saying that the structure of the document is encoded in the formatting, and you want to produce XML files that capture that structure, whilst keeping the content in plain text?

If that is so you will need to parse the documents, and build a data structure that can be processed, then dumped out as XML.

For parsing, there are a few options. Microsoft have published the specifications for their binary .doc format, the reading of which will be essential to write a parser for it. In the case of .docx you're a little more lucky, as it's already in XML format, so you could use any XML parsing library to read in the file, then search through the resulting tree for the data you are interested in. XML parsers are available for pretty much any language, one easy to use one that comes to mind is MiniDom for Python.

For generating your output XML, again an object-representation to XML library seems to be the way to go, MiniDom for example, does that too.

If you don't want to deal with writing your own .doc parser, you could run the documents through a converter that produces are more accessible format first - such as using Word itself to convert the .doc files to .docx, or a tool that produces RDFs from .docs, or you could use an existing word parser such as the one in OpenOffice.

David Claridge
  • 6,159
  • 2
  • 27
  • 25
  • There is a good python module to remotely control OpenOffice to convert document formats - including convert .doc onto .ODT which contains a zipped XML file thats easy to parse. http://www.artofsolving.com/opensource/pyodconverter – jsbueno Nov 24 '10 at 03:00
  • 4
    BTW, trying to write a .doc parser from MS specs would be suicidal at best - there was a reason why the initial .docx specs had 6000 pages - either use a module that does that already, or convert the .doc to another thing with OOo. – jsbueno Nov 24 '10 at 03:02
2

Used a very inefficient conditional search in VBA to literally copy the document into a second document. The second document was then saved with a .xml extension. Got the job done, but its ugly.

Mikhail
  • 7,749
  • 11
  • 62
  • 136
1

You can also try Java based Apache POI - HWPF. It supports text extraction. You will then have to create you own XML doc, Caster XML or Xstream can help you on that issue.

n002213f
  • 7,805
  • 13
  • 69
  • 105
0

It really depends on exactly what you are trying to do.

The simplest approach would be to save the document as Flat OPC XML (in Word, "Save as.." XML), and then apply an XSLT.

This approach is simplest, since it gives you the entire docx as a single XML file, so you don't have to unzip it etc.

If your requirements are more complex, for example, analyzing the formatting or styles, or doing something with hyperlinks, then an object model such as docx4j (Java) or Open XML SDK (C#) - and no doubt there are others - may help.

JasonPlutext
  • 15,352
  • 4
  • 44
  • 84