.docx is the file extension for files created using the default format of Microsoft Word 2007 or higher. Use this tag when you are working with .docx files programmatically, such as generating .docx, extracting data from .docx or editing a .docx
.docx
is the file extension for files created using the default format of Microsoft Word 2007 or higher. This is the Microsoft Office Open XML WordProcessingML format. This format is based around a zipped collection of eXtensible Markup Language (XML) files. Microsoft Office Open XML WordProcessingML is mostly standardized in ECMA 376 and ISO 29500.
Formerly, Microsoft used the BIFF (Binary Interchange File Format) binary format (.xls
, .doc
, .ppt
). It now uses the OOXML (Office Open XML) format. These files (.xlsx
, .xlsm
, .docx
, .docm
, .pptx
, .pptm
) are zipped-XML.
.docx
is the new default Word format, it cannot contain any VBA (for security reasons as stated by Microsoft).
.docm
is the new Word format that can store VBA and execute macros.
The .docx
format is a zipped file that contains the following folders:
+--docProps
| + app.xml
| \ core.xml
+ res.log
+--word //this folder contains most of the files that control the content of the document
| + document.xml //Is the actual content of the document
| + endnotes.xml
| + fontTable.xml
| + footer1.xml //Containst the elements in the footer of the document
| + footnotes.xml
| +--media //This folder contains all images embedded in the word
| | \ image1.jpeg
| + settings.xml
| + styles.xml
| + stylesWithEffects.xml
| +--theme
| | \ theme1.xml
| + webSettings.xml
| \--_rels
| \ document.xml.rels //this document tells word where the images are situated
+ [Content_Types].xml
\--_rels
\ .rels
The main content of a docx file resides in word/document.xml
.
A typical word/document.xml
looks like this :
<w:body>
<w:p w:rsidR="001A6335" w:rsidRPr="0059122C" w:rsidRDefault="0059122C" w:rsidP="0059122C">
<w:r>
<w:t>Hello </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r w:rsidR="008B4316">
<w:t>W</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t>orld</w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="_GoBack"/>
<w:bookmarkEnd w:id="0"/>
</w:p>
<w:sectPr w:rsidR="001A6335" w:rsidRPr="0059122C" w:rsidSect="001A6335">
<w:headerReference w:type="default" r:id="rId7"/>
<w:footerReference w:type="default" r:id="rId8"/>
<w:pgSz w:w="12240" w:h="15840"/>
<w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>
<w:cols w:space="720"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
The tags are w:body
(for the whole document), and then the document is separated in multiple w:p
(paragraphs). And a w:sectPr
, which defines the headers/footers used for that document.
Inside a w:p
, there are multiple w:r
(runs). Every run defines its own style (color of the text, font-size, ...), and every run contains multiple w:t
(text parts).
As you can see, a simple sentence like Hello World
might be separated in multiple w:t
, which makes templating quite difficult to implement.