0

I am trying to preform find and replace on docx files, whilst still maintaining formatting.

From reading up on this, the best way seems to be preforming the find/replace on the xml file of the document.

I can load in the xml file and find/replace on it, but unsure how to write it back.

docx:

Hello {text}!

python:

import zipfile

zip = zipfile.ZipFile(open('test.docx', 'rb'))
xmlString = zip.read('word/document.xml').decode('utf-8')
xmlString.replace('{text}', 'world')
user3972986
  • 490
  • 1
  • 5
  • 14

3 Answers3

0

You're really going to want to use a library for reading/writing docx files rather than trying to just deal with them as raw XML. A cursory search came up with the pypi module docx but I haven't used this module so I can't endorse it: https://pypi.python.org/pypi/docx/0.2.4

I've had the (unfortunate) experience of dealing with the manipulation of MS Office documents from other programming languages, and spending the time to find good libraries really paid off.

The old saying goes "don't reinvent the wheel" and I think that's definitely true when manipulating non-trivial file formats. If a somewhat mature library exists to do the job, use it!

davecom
  • 1,499
  • 10
  • 30
  • Thanks for your response. I have used docx before but from my experience it does not carry over the formatting – user3972986 Nov 19 '15 at 13:03
  • There appear to be several other candidate libraries like `python-docx` and `docx_replace` seems especially appropriate to your needs. I would spend the time researching them. – davecom Nov 19 '15 at 13:05
  • python-docx does not carry over the formatting either. I have never tried docx_replace. Will give it a go – user3972986 Nov 19 '15 at 13:16
0

You would need to replace the file in the zip archive. There is no "simple" way of achieving this. The following is a question that should help:

overwriting file in ziparchive

Community
  • 1
  • 1
shahvishal
  • 171
  • 4
0

What you are trying is dangerous, because you are processing a high lever docx file at a lower level. If you really want to do it, just use the hints from overwriting file in ziparchive as suggested by @shahvishal.

But unless you fully know all the details of docx format, my advice is : do not do that. Suppose there is for any reason in an internal field or attribute the string {text}. You are likely to change the file in an unexpected way leading immediately or even worse later to the destruction of the file (Word being no longer able to process it).

If you do your processing on a Windows machine with an installed word, you certainly could try to use automation to process the file with Microsoft Word. Unfortunately, I only did that long time ago and cannot give useful links. You will need:

  • general knowledge on pywin30 module
  • sufficient knowledge on the Automation interface of MS/WORD. Hopefully, its documentation is nice with many examples provided you have a full installation of Microsoft Office including macro help
Community
  • 1
  • 1
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252