Write Open Office XML (e.g. docx) with XML that matches the OOXML namespace

Question

I have a python program that edits the XML in a .docx file. I'd like to edit the XML with ETree.

When I read the XML from the .docx file, it begins like this:

b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:wpc="http://schemas.micro'...

This is in a variable called data. I create the element tree with:

import xml.etree.ElementTree as ElementTree
tree = ElementTree.XML(data)

I convert it back with:

data = ElementTree.tostring(tree)

However, there have been subtle changes to the XML. It now looks like this:

b'<ns0:document xmlns:ns0="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:ns1="ht...

Word won't read this, even though it is standard XML.

EDIT: I tried adding the string to my XML, just to get it to round-trip:

XML_HEADER=b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n'
tree = ElementTree.XML(data)
data = XML_HEADER + ElementTree.tostring(tree)

But I still get the error:

We're sorry. We can't open <filename>.docx because we found a problem with its contents.
Details:
The XML data is invalid according to the schema.
Location: Part: /word/document.xml, Line: 0, Column:0

I can't fix word. I've got to generate XML that looks exactly like the XML that I started with. How do I get ETree to generate that?

Namespace prefixes names and the order of namespace declarations are insignificant. Word won't care. — kjhughes, Oct 04 '18 at 16:27
Try prepending an XML declaration. The `version` and `encoding` values are defaults and shouldn't matter, but the `standalone` value is a little trickier. It's easy enough to prepend the original XML declaration and rule that out as an issue. — kjhughes, Oct 04 '18 at 16:51
@kjhughes, thanks! That's closer. At least I am now able to open the file, although it gives me an error... — vy32, Oct 04 '18 at 18:16
So that's the message you get after you re-zip the re-written document.xml file? Two thoughts: (1) Make sure you're re-zipping properly by re-zipping the original document.xml file as a sanity check. (2) If that works, then you know you can focus on the document.xml file, and I'd start by making sure you've not introduced any content ahead of the XML declaration. See [**here**](https://stackoverflow.com/a/19898942/290085), especially the BOM part. — kjhughes, Oct 04 '18 at 19:51
Yes, that's what I get. My code works properly. Right now I'm hacking the XML with regular expressions, which is gross, but it works. My problem is getting the namespace stuff to work. I can post a complete example, if you wish. — vy32, Oct 04 '18 at 21:17
"My code works properly." By that, do you mean the problem has been solved? — kjhughes, Oct 04 '18 at 21:24
"hacking the XML with regular expressions" It's worse than gross; it's unprofessional, non-robust, and bound to fail. And you're dealing with OOXML. You're doomed if you stick with regex. — kjhughes, Oct 04 '18 at 21:25
I mean that my code opens the zip archive, reads the file, butchers the XML with a hatchet (regular expressions), and writes out the ZIP archive, and the file opens in Microsoft Office. However, editing XML with regular expressions is always a bad idea, and I would like to use element tree. And I have problems. For example, Word routinely breaks up words into multiple text runs, and with etree it would be trivial to combine them. It's not trivial with regular expressions. I can't use those fancy docx reader/writers because they don't handle all of the Word properties. — vy32, Oct 04 '18 at 21:26
"I can post a complete example, if you wish." No, unless you take the time to cull a truly minimal [mcve], nobody here is going to want to wade through it. This is looking more like a consulting gig than a Q/A, so I'm going to have to bow out at this point. Good luck. — kjhughes, Oct 04 '18 at 21:26

Write Open Office XML (e.g. docx) with XML that matches the OOXML namespace

0 Answers0