3

Possible Duplicate:
extracting text from MS word files in python

I want to parse (in order to perform a search with an expression) a .doc file with a script in python. It runs on a unix machine.

Can anyone help ?

Community
  • 1
  • 1
hlx
  • 182
  • 1
  • 4
  • 15

2 Answers2

4

You can use, PyUno

Sample,

# HelloWorld python script for the scripting framework

def HelloWorldPython( ):
    """Prints the string 'Hello World(in Python)' into the current document"""
#get the doc from the scripting context which is made available to all scripts
    model = XSCRIPTCONTEXT.getDocument()
#get the XText interface
    text = model.Text
#create an XTextRange at the end of the document
    tRange = text.End
#and set the string
    tRange.String = "Hello World (in Python)"
    return None

Other, PyUNO samples

Adem Öztaş
  • 20,457
  • 4
  • 34
  • 42
3

You may take a look at this project: python-docx. After downloading the library, you can run python example-extracttext.py docfile.docx textfile.txt | grep some-expression in the shell. Surely you can also do more sophisticated search in python code when necessary.

The shortcoming of python-docx is it currently only supports ms-Word 2007/2008, if that concerns you, I recommend antiword, which supports Microsoft Word version 2, 6, 7, 97, 2000, 2002 and 2003. Actually I've been using that in my vimrc to be able to view ms-word files in VIM editor. Although it's not a python script, it can easily be invoked from Python.

Hui Zheng
  • 10,084
  • 2
  • 35
  • 40