Questions tagged [docx]

.docx is the file extension for files created using the default format of Microsoft Word 2007 or higher. Use this tag when you are working with .docx files programmatically, such as generating .docx, extracting data from .docx or editing a .docx

.docx is the file extension for files created using the default format of Microsoft Word 2007 or higher. This is the Microsoft Office Open XML WordProcessingML format. This format is based around a zipped collection of eXtensible Markup Language (XML) files. Microsoft Office Open XML WordProcessingML is mostly standardized in ECMA 376 and ISO 29500.

Formerly, Microsoft used the BIFF (Binary Interchange File Format) binary format (.xls, .doc, .ppt). It now uses the OOXML (Office Open XML) format. These files (.xlsx, .xlsm, .docx, .docm, .pptx, .pptm) are zipped-XML.

.docx is the new default Word format, it cannot contain any VBA (for security reasons as stated by Microsoft).
.docm is the new Word format that can store VBA and execute macros.

The .docx format is a zipped file that contains the following folders:

+--docProps
|  +  app.xml
|  \  core.xml
+  res.log
+--word //this folder contains most of the files that control the content of the document
|  +  document.xml //Is the actual content of the document
|  +  endnotes.xml
|  +  fontTable.xml
|  +  footer1.xml //Containst the elements in the footer of the document
|  +  footnotes.xml
|  +--media //This folder contains all images embedded in the word
|  |  \  image1.jpeg
|  +  settings.xml
|  +  styles.xml
|  +  stylesWithEffects.xml
|  +--theme
|  |  \  theme1.xml
|  +  webSettings.xml
|  \--_rels
|     \  document.xml.rels //this document tells word where the images are situated
+  [Content_Types].xml
\--_rels
   \  .rels

The main content of a docx file resides in word/document.xml.

A typical word/document.xml looks like this :

<w:body>
  <w:p w:rsidR="001A6335" w:rsidRPr="0059122C" w:rsidRDefault="0059122C" w:rsidP="0059122C">
    <w:r>
      <w:t>Hello </w:t>
    </w:r>
    <w:proofErr w:type="spellStart"/>
    <w:r w:rsidR="008B4316">
      <w:t>W</w:t>
    </w:r>
    <w:proofErr w:type="spellEnd"/>
    <w:r>
      <w:t>orld</w:t>
    </w:r>
    <w:bookmarkStart w:id="0" w:name="_GoBack"/>
    <w:bookmarkEnd w:id="0"/>
  </w:p>
  <w:sectPr w:rsidR="001A6335" w:rsidRPr="0059122C" w:rsidSect="001A6335">
    <w:headerReference w:type="default" r:id="rId7"/>
    <w:footerReference w:type="default" r:id="rId8"/>
    <w:pgSz w:w="12240" w:h="15840"/>
    <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>
    <w:cols w:space="720"/>
    <w:docGrid w:linePitch="360"/>
  </w:sectPr>
</w:body>

The tags are w:body (for the whole document), and then the document is separated in multiple w:p (paragraphs). And a w:sectPr, which defines the headers/footers used for that document.

Inside a w:p, there are multiple w:r (runs). Every run defines its own style (color of the text, font-size, ...), and every run contains multiple w:t (text parts).

As you can see, a simple sentence like Hello World might be separated in multiple w:t, which makes templating quite difficult to implement.

3020 questions
113
votes
16 answers

Is there a Java API that can create rich Word documents?

I have a new app I'll be working on where I have to generate a Word document that contains tables, graphs, a table of contents and text. What's a good API to use for this? How sure are you that it supports graphs, ToCs, and tables? What are some…
billjamesdev
  • 14,554
  • 6
  • 53
  • 76
83
votes
10 answers

How can I search a word in a Word 2007 .docx file?

I'd like to search a Word 2007 file (.docx) for a text string, e.g., "some special phrase" that could/would be found from a search within Word. Is there a way from Python to see the text? I have no interest in formatting - I just want to classify…
Gerry
  • 1,303
  • 1
  • 10
  • 16
77
votes
4 answers

Markdown to docx, including complex template

I have automated my build to convert Markdown files to DOCX files using Pandoc. I have even used a reference document for the final document's styling. The command I use is: pandoc -f markdown -t docx --data-dir=docs/rendering/ mydoc.md -o…
Synesso
  • 37,610
  • 35
  • 136
  • 207
59
votes
6 answers

How to extract just plain text from .doc & .docx files?

Anyone know of anything they can recommend in order to extract just the plain text from a .doc or .docx? I've found this - wondered if there were any other suggestions?
docextract
  • 663
  • 1
  • 6
  • 3
56
votes
5 answers

How to extract text from word file .doc,docx,.xlsx,.pptx php

There may be a scenario we need to get the text from word documents for the future use to search the string in the document uploaded by user like for searching in cv's/resumes and occurs a common problem that how to get the text , Open and read a…
M Khalid Junaid
  • 63,861
  • 10
  • 90
  • 118
40
votes
7 answers

Version control for DOCX and PDF?

I've been playing around with git and hg lately and then suddenly it occurred to me that this kind of thing will be great for documents. I've a document which I edit in DOCX and export as PDF. I tried using both git and hg to version control it and…
Jungle Hunter
  • 7,233
  • 11
  • 42
  • 67
37
votes
4 answers

How to zip a WordprocessingML folder into readable docx

I have been trying to write a simple Markdown -> docx parser/writer, but am completely stuck with the last part, which should be the easiest: i.e. compressing the folder into a .docx that Word, or any other .docx reader, will recognize. My…
Michael
  • 371
  • 1
  • 3
  • 4
36
votes
5 answers

Inserting Image into DocX using OpenXML and setting the size

I am using OpenXML to insert an image into my document. The code provided by Microsoft works, but makes the image much smaller: public static void InsertAPicture(string document, string fileName) { using (WordprocessingDocument…
LunchMarble
  • 5,079
  • 9
  • 64
  • 94
36
votes
6 answers

Knitr & Rmarkdown docx tables

When using knitr and rmarkdown together to create a word document you can use an existing document to style the output. For example in my yaml header: output: word_document: reference_docx: style.docx fig_caption: TRUE within this style…
zacdav
  • 4,603
  • 2
  • 16
  • 37
36
votes
2 answers

How can I create a simple docx file with Apache POI?

I'm searching for a simple example code or a complete tutorial how to create a docx file with Apache POI and its underlying openxml4j. I tried the following code (with a lot of help from the Content Assist, thanks Eclipse!) but the code does not…
guerda
  • 23,388
  • 27
  • 97
  • 146
31
votes
3 answers

Chrome says: "Resource interpreted as Document but transferred with MIME type application/vnd.openxmlformats-officedocument.wordprocessingml.document"

I am offering a file for download from my site, which is working. However, I am noticing this behavior from Chrome. I think I have the correct MIME Type set but Chrome is showing this message and also marks the request in red. The MIME type I have…
Michael
  • 3,568
  • 3
  • 37
  • 50
29
votes
6 answers

Converting docx to pdf with pure python (on linux, without libreoffice)

I'm dealing with a problem trying to develop a web-app, part of which converts uploaded docx files to pdf files (after some processing). With python-docx and other methods, I do not require a windows machine with word installed, or even libreoffice…
Ofer Sadan
  • 11,391
  • 5
  • 38
  • 62
29
votes
8 answers

Append multiple DOCX files together

I need to use C# programatically to append several preexisting docx files into a single, long docx file - including special markups like bullets and images. Header and footer information will be stripped out, so those won't be around to cause any…
ShootTheCore
  • 449
  • 1
  • 5
  • 8
28
votes
8 answers

Add styling rules in pandoc tables for odt/docx output (table borders)

I'm generating some odt/docx reports via markdown using knitr and pandoc and am now wondering how you'd go about formating tables. Primarily I'm interested in adding rules (at least top, bottom and one below the header, but being able to add…
Tilo Wiklund
  • 751
  • 1
  • 10
  • 15
27
votes
9 answers

Why are .docx files being corrupted when downloading from an ASP.NET page?

I have this following code for bringing page attachments to the user: private void GetFile(string package, string filename) { var stream = new MemoryStream(); try { using (ZipFile zip = ZipFile.Read(package)) { …
Victor Rodrigues
  • 11,353
  • 23
  • 75
  • 107
1
2 3
99 100