0

I am attempting to extract the XML code from a Word document with Python. Here's the code I tried:

def getXml(docxFilename):
    zip = zipfile.ZipFile(open(docxFilename,"rb"))
    xmlString= str(zip.read("word/document.xml"))
    return xmlString

I created a test document and ran the function getXML on it. Here's the result:

 b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidRDefault="00B52719"><w:pPr><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr></w:pPr><w:r><w:t>Test</w:t></w:r></w:p><w:sectPr w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidSect="009C4305"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>'

There are some obvious issues. One is that the XML code begins with an "b' " and ends with an apostrophe. Second, there is a "\r\n" right after the first set of angle brackets.

My ultimate goal is to modify the XML code to create a new Word document -- see this question -- but the anomalies with the extracted XML are preventing me from doing this.

Does anyone know why the extracted XML has these strange features and how I can remove them?

EDIT: I tried using the lxml module to parse this code but I only got different errors.

I created a new function getXmlTree:

from lxml import etree

def getXmlTree(xmlString):
    return etree.fromstring(xmlString)

I then ran the code etree.tostring(getXmlTree(getXml("test.docx")),pretty_print=True) and received much more sensible XML code.

The problems arise when I tried to create a new Word document. I created the following function to convert XML code into a Word document (shamelessly stolen from here):

import zipfile
from lxml import etree
import os
import tempfile
import shutil

def createNewDocx(originalDocx,xmlContent,newFilename):
    tmpDir = tempfile.mkdtemp()
    zip = zipfile.ZipFile(open(originalDocx,"rb"))
    zip.extractall(tmpDir)
    with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
        xmlString = etree.tostring(xmlContent,pretty_print=True)
        f.write(xmlString)
    filenames = zip.namelist()
    zipCopyFilename = newFilename
    with zipfile.ZipFile(zipCopyFilename,"w") as docx:
        for filename in filenames:
            docx.write(os.path.join(tmpDir,filename),filename)
    shutil.rmtree(tmpDir)

Before trying to create a new Word document, I wanted to see if I could create a copy of my original test document by substituting xmlContent = getXmlTree(getXml("test.docx")) as an argument in the above function. When I ran the code, however, I received an error message:

f.write(xmlString)

TypeError: must be str, not bytes

Using f.write(str(xmlString)) instead didn't help; it created a new word document, but Word would crash if I tried to open it.

EDIT2: tried running the above code with f.write(xmlString.decode("utf-8")) instead, but it didn't help; Word still crashed.

Community
  • 1
  • 1
Alessandro Power
  • 2,395
  • 2
  • 19
  • 39

1 Answers1

0

My guess is that the XML is not being encoded properly. First, write the document file as binary using "wb" as the mode. Second, tell etree.tostring() what the encoding is and to include the XML declaration.

with open(os.path.join(tmpDir, "word/document.xml"), "wb") as f:
    xmlBytes = etree.tostring(xmlContent, encoding="UTF-8", xml_declaration=True, pretty_print=True)
    f.write(xmlBytes)
Uyghur Lives Matter
  • 18,820
  • 42
  • 108
  • 144
  • Thanks. I didn't follow your exact solution; instead I just replaced `xmlString= str(zip.read("word/document.xml"))` in `getXml` by `xmlString = zip.read("word/document.xml").decode("uft-8")`, which did the trick. I'm still having a few issues though, which are outlined [here](http://stackoverflow.com/questions/27535032/problems-extracting-the-xml-from-a-word-document-in-french-with-python-illegal); I would greatly appreciate it if you could take a look. – Alessandro Power Dec 17 '14 at 21:20