I am attempting to extract the XML code from a Word document with Python. Here's the code I tried:
def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString= str(zip.read("word/document.xml"))
return xmlString
I created a test document and ran the function getXML
on it. Here's the result:
b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidRDefault="00B52719"><w:pPr><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr></w:pPr><w:r><w:t>Test</w:t></w:r></w:p><w:sectPr w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidSect="009C4305"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>'
There are some obvious issues. One is that the XML code begins with an "b' " and ends with an apostrophe. Second, there is a "\r\n" right after the first set of angle brackets.
My ultimate goal is to modify the XML code to create a new Word document -- see this question -- but the anomalies with the extracted XML are preventing me from doing this.
Does anyone know why the extracted XML has these strange features and how I can remove them?
EDIT: I tried using the lxml module to parse this code but I only got different errors.
I created a new function getXmlTree
:
from lxml import etree
def getXmlTree(xmlString):
return etree.fromstring(xmlString)
I then ran the code etree.tostring(getXmlTree(getXml("test.docx")),pretty_print=True)
and received much more sensible XML code.
The problems arise when I tried to create a new Word document. I created the following function to convert XML code into a Word document (shamelessly stolen from here):
import zipfile
from lxml import etree
import os
import tempfile
import shutil
def createNewDocx(originalDocx,xmlContent,newFilename):
tmpDir = tempfile.mkdtemp()
zip = zipfile.ZipFile(open(originalDocx,"rb"))
zip.extractall(tmpDir)
with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
xmlString = etree.tostring(xmlContent,pretty_print=True)
f.write(xmlString)
filenames = zip.namelist()
zipCopyFilename = newFilename
with zipfile.ZipFile(zipCopyFilename,"w") as docx:
for filename in filenames:
docx.write(os.path.join(tmpDir,filename),filename)
shutil.rmtree(tmpDir)
Before trying to create a new Word document, I wanted to see if I could create a copy of my original test document by substituting xmlContent = getXmlTree(getXml("test.docx"))
as an argument in the above function. When I ran the code, however, I received an error message:
f.write(xmlString)
TypeError: must be str, not bytes
Using f.write(str(xmlString))
instead didn't help; it created a new word document, but Word would crash if I tried to open it.
EDIT2: tried running the above code with f.write(xmlString.decode("utf-8"))
instead, but it didn't help; Word still crashed.