3

I want to create a validation tool;

Can any one help me read .doc/.docx documents in Python in order to search and compare the file contents.

HaveNoDisplayName
  • 8,291
  • 106
  • 37
  • 47
nalinareka
  • 39
  • 1
  • 1
  • 2
  • possible duplicate of [extracting text from MS word files in python](http://stackoverflow.com/questions/125222/extracting-text-from-ms-word-files-in-python) – Amir Ali Akbari Jan 15 '15 at 07:09

2 Answers2

8

Yes it is possible. LibreOffice (at least) has a command line option to convert files that works a treat. Use that to convert the file to text. Then load the text file into Python as per routine manoeuvres.

This worked for me on LibreOffice 4.2 / Linux:

soffice --headless --convert-to txt:Text /path_to/document_to_convert.doc


I've tried a few methods (including odt2txt, antiword, zipfile, lpod, uno). The above soffice command was the first that worked simply and without error. This question on using filters with soffice on ask.libreoffice.org helped me.

markling
  • 1,232
  • 1
  • 15
  • 28
2

You can try using PyWin32 to access Word via COM, although that will be a little ugly. You could also look at IronPython since it's built with .NET and may have better hooks into Office.

See also the following:

Community
  • 1
  • 1
Mike Driscoll
  • 32,629
  • 8
  • 45
  • 88