2

I am attempting to create a script which can extract the XML from a Word document, modify it, and finally save the new Word document, all using Python. Here's the code I used, which was effectively stolen from here:

import zipfile
import os
import tempfile
import shutil


def getXml(docxFilename):
    zip = zipfile.ZipFile(open(docxFilename,"rb"))
    xmlString = str(zip.read("word/document.xml"))
    return xmlString

def createNewDocx(originalDocx,xmlContent,newFilename):
    tmpDir = tempfile.mkdtemp()
    zip = zipfile.ZipFile(open(originalDocx,"rb"))
    zip.extractall(tmpDir)
    with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
        f.write(xmlContent)
    filenames = zip.namelist()
    zipCopyFilename = newFilename
    with zipfile.ZipFile(zipCopyFilename,"w") as docx:
        for filename in filenames:
            docx.write(os.path.join(tmpDir,filename),filename)
    shutil.rmtree(tmpDir)

One important difference between my code and Virantha's is that he expressed createNewDocx as a class. Unfortunately I don't know what classes are or how they work, so I figured it would be easier to write a function instead.

getXML extracts the XML from a Word document. I tried it out on a test document (named test.docx) and it worked well. In theory, createNewDocx is supposed to take the original docx file (in this case, test.docs) and the modified XML as a string to create a new Word document, entitled newFileName.

As a test, I ran createNewDocx with the original XML to see if I would get a copied version of text.docx. That is, I ran

originalXml = getXml("test.docx")
createNewDocx("test.docx",originalXml,"test2.docx")

This did indeed create a Word document entitled "test2.docx", but when I tried to open the file it just wouldn't open; Word would just crash.

Does anyone know how I can modify my code to make it work?

EDIT: I decided to include originalXml in case there's some problem with how it's formatted.

b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="00000000" w:rsidRDefault="00971B91"><w:r><w:t>You owe me ${debt}. Pay back soon.</w:t></w:r></w:p><w:p w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidRDefault="00971B91"><w:pPr><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr></w:pPr><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t xml:space="preserve">You owe me </w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:b/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t>${debt}</w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t xml:space="preserve">. Pay back </w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:i/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t>soon.</w:t></w:r></w:p><w:sectPr w:rsidR="00971B91" w:rsidRPr="00971B91"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>'

EDIT2: I looked more closely at the XML code above and realized that there was an unusual "b'" at the beginning and a close parentheses at the end. I removed these anomalies and ran the code again. Now Word is giving me a more sensible error, namely that there's a problem with "line 1, column 56." That corresponds to the "\r\" in the XML code above.

So obviously my code isn't extracting the XML properly. Anyone know how to fix this?

Alessandro Power
  • 2,395
  • 2
  • 19
  • 39
  • What happens when you try to unzip your generated docx ? – edi9999 Dec 16 '14 at 08:20
  • @edi9999 Nothing unusual. If I run the code `zip = zipfile.ZipFile(open("test2.docx","rb")); filenames = zip.namelist(); print(filenames)` I get the following list: '[Content_Types].xml', '_rels/.rels', 'word/_rels/document.xml.rels', 'word/document.xml', 'word/theme/theme1.xml', 'word/settings.xml', 'word/fontTable.xml', 'word/webSettings.xml', 'docProps/app.xml', 'docProps/core.xml', 'word/styles.xml' – Alessandro Power Dec 16 '14 at 14:31
  • 1
    The b'...' around your XML is Python's way of telling you that Python takes it as a buffer, and not a string. If you want to make a string out of a buffer, you need to decode it as in `mystring = mybuffer.decode('utf-8')` (provided it's UTF-8 encoded). – Karel Kubat Dec 17 '14 at 15:37
  • @KarelKubat If I try running your code I get an Attribute Error: 'str' object has no attribute 'decode'. Do I have to import a module? Or could we using different versions of Python? (I'm using Python 3.4) – Alessandro Power Dec 17 '14 at 15:46
  • 1
    @AlessandroPower, that means that you're trying to `.decode()` something that's already a string. Which contradicts your example above, in which you have a buffer `b''` – Karel Kubat Dec 17 '14 at 19:32
  • @KarelKubat I'm not sure what I did wrong before but if I replace `xmlString = str(zip.read("word/document.xml"))` in `getXml` by `xmlString = zip.read("word/document.xml").decode("uft-8")` as per your suggestion, everything works. Thanks a lot, I really appreciate it. I'm still having a few issues though, which I list in [this question](http://stackoverflow.com/questions/27535032/problems-extracting-the-xml-from-a-word-document-in-french-with-python-illegal). Is there any chance you can take a look at it? – Alessandro Power Dec 17 '14 at 21:19

1 Answers1

0

By casting "zip.read("word/document.xml")", you cast a byte to string so you keep the 'b' as a char.

def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString = str(zip.read("word/document.xml"))
return xmlString

So that's why the "xmlString" has no attribute because it's a string. You have to remove you cast an decode before return:

def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString = zip.read("word/document.xml")
return xmlString.decode('utf-8')

Hope it will be helpful for others !

Anthony
  • 1
  • 1