Possible Duplicate:
extracting text from MS word files in python
I want to parse (in order to perform a search with an expression) a .doc file with a script in python. It runs on a unix machine.
Can anyone help ?
Possible Duplicate:
extracting text from MS word files in python
I want to parse (in order to perform a search with an expression) a .doc file with a script in python. It runs on a unix machine.
Can anyone help ?
You can use, PyUno
Sample,
# HelloWorld python script for the scripting framework
def HelloWorldPython( ):
"""Prints the string 'Hello World(in Python)' into the current document"""
#get the doc from the scripting context which is made available to all scripts
model = XSCRIPTCONTEXT.getDocument()
#get the XText interface
text = model.Text
#create an XTextRange at the end of the document
tRange = text.End
#and set the string
tRange.String = "Hello World (in Python)"
return None
Other, PyUNO samples
You may take a look at this project: python-docx.
After downloading the library, you can run python example-extracttext.py docfile.docx textfile.txt | grep some-expression
in the shell. Surely you can also do more sophisticated search in python code when necessary.
The shortcoming of python-docx is it currently only supports ms-Word 2007/2008, if that concerns you, I recommend antiword, which supports Microsoft Word version 2, 6, 7, 97, 2000, 2002 and 2003. Actually I've been using that in my vimrc to be able to view ms-word files in VIM editor. Although it's not a python script, it can easily be invoked from Python.