40

I've been playing around with git and hg lately and then suddenly it occurred to me that this kind of thing will be great for documents.

I've a document which I edit in DOCX and export as PDF. I tried using both git and hg to version control it and turns out with hg you end up tracking only binary and diff-ing isn't meaningful. Although with git I can meaningfully diff DOCX (haven't tried on PDF yet) I was wondering if there is a better way to do it than I'm doing it right now. (Ideally, not having to leave Word to diff will be the best solution.)

Nakilon
  • 34,866
  • 14
  • 107
  • 142
Jungle Hunter
  • 7,233
  • 11
  • 42
  • 67

7 Answers7

16

There are two different concepts here - one is "can the version control system make some intelligent judgements about the contents of files?" - so that it can store just delta information between revisions (and do things like assign responsibility to individual parts of a file).

The other is 'do I have a file comparison tool which is useful for the types of files I have in the version control system'. Version control systems tend to come with file comparison tools which are inferior to dedicated alternatives. But they can pretty much always be linked to better diff programs - either for all file types or specific ones.

So it's common to use, for example, Beyond Compare as a general compare tool, with Word as a dedicated Word document comparer.

Different version control systems differ as to how good people perceive them to be at handling 'binaries', but that's often as much to do with handling huge files and providing exclusive locking as it is to do with file comparison.

Will Dean
  • 39,055
  • 11
  • 90
  • 118
  • Yes, I'm aware of the "binary" capability discussions. So what would you suggest - how should one go about recording the history of their DOCXs? – Jungle Hunter Jul 22 '10 at 14:53
  • 3
    I'd put them in any kind of VCS, and configure that VCS to use Word for comparing DOCX files. – Will Dean Jul 22 '10 at 15:40
  • How do I configure TortoiseHg to do this automatically when I double click an old version of the DOCX? – Jungle Hunter Jul 23 '10 at 08:40
  • Ashish - sorry, I don't know, you'll have to look that one up. – Will Dean Jul 23 '10 at 15:02
  • What do you mean with "Word as a dedicated Word document comparer"? Isn't Word only a word processor? – HelloGoodbye Dec 23 '13 at 14:31
  • 1
    @HelloGoodbye Word can compare two Word documents, showing you the differences between them. Whether or not that's part of being 'only a word processor' rather depends on what you consider the features of 'a word processor' to be... – Will Dean Dec 23 '13 at 14:36
9

http://tortoisehg.bitbucket.io/ includes a plugin called docdiff that integrates Word and Excel diff'ing.

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Joshua
  • 26,234
  • 22
  • 77
  • 106
5

This article outlines the solution for Docx using Pandoc While this post outlines solution for PDF using pdf2html.

Daniil Shevelev
  • 11,739
  • 12
  • 50
  • 73
5

You can use Beyond Compare as external diff tool for hg. Add to/change your user mercurial.ini as:

[extdiff]
cmd.vdiff = c:/path/to/BCompare.exe

Then get Beyond Compare file viewer rule for docx.

Now you should be able to compare two versions of docx in Beyond Compare.

Geoffrey Zheng
  • 6,562
  • 2
  • 38
  • 47
4

Only for docx, I compiled instructions for multiple places here: https://gist.github.com/nachocab/6429893

# download docx2txt by Sandeep Kumar
wget -O docx2txt.pl http://www.cs.indiana.edu/~kinzler/home/binp/docx2txt

# make a wrapper 
echo '#!/bin/bash
docx2txt.pl $1 -' > docx2txt
chmod +x docx2txt

# make sure docx2txt.pl and docx2txt are your current PATH. Here's a guide
http://shapeshed.com/using_custom_shell_scripts_on_osx_or_linux/
mv docx2txt docx2txt.pl ~/bin/

# set .gitattributes (unfortunately I don't this can't be set by default, you have to create it for every project)
echo "*.docx diff=word" > .git/info/attributes

# add the following to ~/.gitconfig
[diff "word"]
    binary = true
    textconv = docx2txt

# add a new alias
[alias]
    wdiff = diff --color-words

# try it
git init

# create my_file.docx, add some content

git add my_file.docx

git commit -m "Initial commit"

# change something in my_file.docx

git wdiff my_file.docx

# awesome!

It works great on OSX

Guildenstern
  • 2,179
  • 1
  • 17
  • 39
nachocab
  • 13,328
  • 21
  • 91
  • 149
2

If you happen to use a Mac, I wrote a git merge driver that can use Microsoft Word and tracked changes to merge and show conflicts between any file types Word can read & write.

http://github.com/jasmas/wordMerge

I say 'if you happen to use a Mac' because the driver I wrote uses AppleScript, primarily to accomplish this task.

It'd be nice to add a vbscript version to the project, but at the moment I don't have a Windows environment for testing. Anyone with some basic scripting knowledge should be able to take a look at what I'm doing and duplicate it in vbscript, powershell or whatever on Windows.

2

I used SVN (yes, in 2020 :-)) with TortoiseSVN on Windows. It has a built-in function to compare DOCX files (it opens Microsoft Word in a mode where your screen is divided into four parts: the file after the changes, before the changes, with changes highlighted and a list of changes). Screenshot below (sorry for the Polish version of MS Word). I also checked TortoiseGIT and it also has this functionality. I've read that TortoiseHG has it as well.

A screenshot of comparison of changes of a file using Microsoft Word and TortoiseSVN

JustAC0der
  • 2,871
  • 3
  • 32
  • 35