37

I have been trying to write a simple Markdown -> docx parser/writer, but am completely stuck with the last part, which should be the easiest: i.e. compressing the folder into a .docx that Word, or any other .docx reader, will recognize.

My parser-writer is irrelevant really: I have this problem if I simply unzip any old Word-produced *.docx and then try to recompress it with the usual compression utilities, giving it the file-ending docx. Is there some mysterious header I should be adding, or do I need a special OPC compression utility, or what?

I don't so much want a tool that will do this, as to figure out what is supposed to be there. It seems to be independent of the WordprocessingML specification.

Needless to say I don't know anything about compression. Everything I can find via Google has to do with fancy utilities you can use in business, but I'm making a little executable that would be GPLd or something, and should work on anything.

Camille G.
  • 3,058
  • 1
  • 25
  • 41
Michael
  • 371
  • 1
  • 3
  • 4
  • 2
    Eric White is exactly right in inferring that I experienced "the most common problem around manually zipping an Open XML document". The error is already visible in the title of the question: I was compressing a folder containing the material, rather than joining the materials severally into a zip file. It occurs to me I might have guessed this, since of course if you unzip a .docx file, you emphatically don't get a little directory, but files all over the directory you're acting in. Thanks! --Of course, this means I should get back to the project I mention above ... :) – applicative Apr 24 '11 at 22:21

4 Answers4

54

The most common problem around manually zipping together Open XML documents is that it will not work if you zip the directory instead of the contents. In other words, the[content_types].xml file, and the word, docProps, and _rels directories need to reside at the root level of the zip file.

Eric White
  • 1,851
  • 11
  • 14
  • 4
    Hi, I am the original poster, but I lost this S.O. account, else I would mark this as the 'right answer'. You are right that my mistake was to zip the directory that included all the material, thinking I needed the right incantation, form of compression ... some subtlety. MSWord is quite willing to open the file if I accumulate all relevant files (including wholesale addition of subdirectories like `word` that are themselves at the root level.) to a single zip file. So far I have tried this on OS X without incident. I will study things more. – applicative Apr 24 '11 at 22:13
  • Truly open, self-made docx by WinZip and WinRAR are all readable! – Lei Yang Nov 15 '13 at 09:45
23

Here are steps to unzip my.docx and re-zip:

% mkdir unzipped
% cd unzipped/
% unzip ../my.docx    
% zip -r ../rezipped.docx *
% open ../rezipped.docx 
Sam Barnum
  • 10,559
  • 3
  • 54
  • 60
3

The compression algorithm used is "Zip" (Base 64) compression.

7zip seems to offer this, though i have no tested it.

Mica
  • 18,501
  • 6
  • 46
  • 43
3

Further to what Mica said, the contents of the ZIP file are organised according to the Open Packaging Convention; cf. Microsoft's Essentials of the Open Packaging Convention.

You can use the .NET System.IO.Packaging to make and manipulate .docx files; this class is implemented in the Mono project.

Charles Stewart
  • 11,661
  • 4
  • 46
  • 85