Parsing a .doc (Word) file with a python script (Unix)

Question

Possible Duplicate:
extracting text from MS word files in python

I want to parse (in order to perform a search with an expression) a .doc file with a script in python. It runs on a unix machine.

Can anyone help ?

Have you tried a simple search? – mmmmmm Jan 29 '13 at 14:03 — mmmmmm, Jan 29 '13 at 14:03

score 4 · Answer 1 · answered Jan 29 '13 at 14:04

4

You can use, PyUno

Sample,

# HelloWorld python script for the scripting framework

def HelloWorldPython( ):
    """Prints the string 'Hello World(in Python)' into the current document"""
#get the doc from the scripting context which is made available to all scripts
    model = XSCRIPTCONTEXT.getDocument()
#get the XText interface
    text = model.Text
#create an XTextRange at the end of the document
    tRange = text.End
#and set the string
    tRange.String = "Hello World (in Python)"
    return None

Other, PyUNO samples

answered Jan 29 '13 at 14:04

Adem Öztaş

20,457
4
34
42

From what i've read you need Open office installed ? – hlx Jan 29 '13 at 14:08
Yes you need to install OpenOffice or LibreOffice. – Adem Öztaş Jan 29 '13 at 14:22
Can't do that. I did not mention it but the file is on the network – hlx Jan 29 '13 at 14:29
Actually you need to lib files. like this LD_LIBRARY_PATH=/usr/lib64/libreoffice/program: – Adem Öztaş Jan 29 '13 at 14:31

Hui Zheng · Accepted Answer · 2013-01-29T15:02:37.673

3

You may take a look at this project: python-docx. After downloading the library, you can run python example-extracttext.py docfile.docx textfile.txt | grep some-expression in the shell. Surely you can also do more sophisticated search in python code when necessary.

The shortcoming of python-docx is it currently only supports ms-Word 2007/2008, if that concerns you, I recommend antiword, which supports Microsoft Word version 2, 6, 7, 97, 2000, 2002 and 2003. Actually I've been using that in my vimrc to be able to view ms-word files in VIM editor. Although it's not a python script, it can easily be invoked from Python.

edited Jan 29 '13 at 15:02

answered Jan 29 '13 at 14:04

Hui Zheng

10,084
2
35
40

1

would it work with .doc ? – hlx Jan 29 '13 at 14:06
1

As it's doc says, it can "Reads, queries and modifies Microsoft Word 2007/2008 docx files" – Hui Zheng Jan 29 '13 at 14:21
.doc is before 2K7, is I'm not mistaken – hlx Jan 29 '13 at 14:32
2

If you need read old-version ms-word files, try antiword. – Hui Zheng Jan 29 '13 at 14:41
Ok I will try with antiword – hlx Jan 29 '13 at 14:53

Parsing a .doc (Word) file with a python script (Unix)

2 Answers2