Editing a DOCX file

Question

I am working on a little project that should be quite simple. I know its been done before but for the life of me, I cannot get it to work. Alright so I made a docx template using Microsoft word that contains a Header and just some text in the body of the paper. My goal is have a program that can change this text. Using python-docx I have successfully been able to write a program that modifies the body text easily. That being said I am trying to learn how to do the same thing using XML parsing, which will allow the header to be changed. Long story short, XML parsing (I think thats what it is) will give me much more freedom down the road.

I know after the docx is unzipped, the word/document.xml contains the body text. Here is my code so far.

from lxml import etree as ET

tree = ET.parse('document.xml')
root = tree.getroot()

for i in root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t'):
    if i.text == 'Title':
        i.text = 'How to cook'

tree.write('document_output.xml', xml_declaration = True, encoding = "UTF-8", method = "xml" \
, standalone = "yes")

This program successfully changes the wanted text to the updated text.

Here is the original document.xml

https://www.dropbox.com/s/ghe1m176rdqtng7/document.xml?dl=0

Here is the output.

https://www.dropbox.com/s/8n9llagozbvb2mz/document_output.xml?dl=0

P.S. viewing the code from dropbox, it makes everything start at line 4 instead of line 1.

If you view them in an XML viewer you can see they are identical. Also, if you use a text difference tool, the only difference is the changed word. And I wouldn't think this would matter but the top line uses single quotes instead of double.

Hope someone can shed some light on why this is still not opening properly in Word.

Thanks for all the help!!

The first problem is not a problem: Namespace prefixes only need to be declared if used, and the prefixes themselves are insignificant; as long as the associated URI is the same, then the namespaced elements are equivalent. What's the second problem, if any? Is your created document appearing in Word as expected? — kjhughes, May 24 '16 at 00:05
When I try to opening it in word, it says the file is corrupt. I am assuming that for it to open properly, everything should be the same as in the original xml file, except for the changed text. I can open the XML file in Notepad and edit the text just fine. That works perfectly. I am just trying to get a python program to do that by XML parsing. — Tyler Bell, May 24 '16 at 00:10
There are many constraints that must be met for an DOCX file to be valid. See, for example, [Where can I find the XSDs of DOCX XML files?](http://stackoverflow.com/questions/36428294/where-can-i-find-the-xsds-of-docx-xml-files). — kjhughes, May 24 '16 at 00:14

score -1 · Answer 1 · edited May 23 '17 at 12:31

-1

you're having the usual problems with ET. As a starter, check out these Stackoverflow threads:

As you can see, you're not the first person with these problems.

What you could do for the namespaces is parse the xml twice:

first time in order to extract the namespaces and
a second time in order to do your actual work.

Besides, some people already suggested to switch from Elementtree to lxml.

edited May 23 '17 at 12:31

Community

1
1

answered May 23 '16 at 22:42

Michi

681
1
7
25

I made edits above. I tried the namespace fix and it fixed the ns0 issue. Still getting a corrupt docx message though. – Tyler Bell May 24 '16 at 00:20
Hmm, I can just guess here: you might change 'utf-8' to 'UTF-8', which might but actually shouldn't be a problem. Otherwise, you could provide the document in the before and after state or provide a diff, so it's easier to track down the problem. – Michi May 24 '16 at 10:01
Got one step closer to getting this to work. On my PC the program works fine. I used my mac to create the new document.xml file and I moved it to my PC and re-zipped the contents of the docx file with the new document.xml file. No problems at all. But it doesn't work on my mac. I do the same thing and microsoft word says the file is corrupt. Must be something to do with how a mac and pc compress files. Any ideas? – Tyler Bell May 25 '16 at 03:15
Yes, apparently Mac does some Mac specific stuff with zip files. You can have a look [here](http://sqlblog.com/blogs/john_paul_cook/archive/2011/07/08/windows-and-mac-not-playing-nicely-with-zip-files.aspx) and [here](https://blogs.msdn.microsoft.com/asklar/2012/05/03/why-do-zip-files-from-mac-os-show-up-as-greenencrypted/). – Michi May 25 '16 at 18:34
Just made edits above. I posted before and after document.xml files. Could you take a look and see what you think? I really appreciate you help with this!! – Tyler Bell May 25 '16 at 19:51
Well, apparently, you created the output on your mac, because the line break is only a linefeed (i.e. `'\r'` or `0x0A`), whereas the original file contains the microsoft line ending format `'\r\n'` or `0x0D 0x0A`. Anyway, I think your main problem is the zip, as you mentioned in your last comment, rather than the line endings. You can try 7zip, which I heard doesn't have this mac vs microsoft problem. But I've never tried that. Please have another look here:http://apple.stackexchange.com/questions/102454/how-can-i-create-a-zip-archive-for-windows-and-linux-users – Michi May 25 '16 at 21:02
Okay thanks, Is there anyway to get it to output, using \r\n? Ill look into 7zip, but I believe there is not a max version. Also, eventually I need the program to do all the zipping anyway – Tyler Bell May 25 '16 at 21:59

Editing a DOCX file

1 Answers1