Document validation / comparison

Question

is there a way to compare two docx documents?

I have one that is generated from a template document where some sections are removed dynamically through bookmarks and block sections from the template.

I would like to compare the generated document with another docx which would be the expected result.

I vaguely heard of checksum comparison,

is there anybody that would have some pointers on the best way to compare 2 documents?

Thanks

possible duplicate of [Comparing two RTF documents side-by-side in Word (VSTO)](http://stackoverflow.com/questions/4962280/comparing-two-rtf-documents-side-by-side-in-word-vsto) — , Jun 21 '11 at 17:06
docx4j contains code for xml differencing; I have used that from C# via IKVM — JasonPlutext, Jun 22 '11 at 01:13
Is this for the purpose of unit testing say, or for displaying differences to the user? — JasonPlutext, Jun 22 '11 at 01:33

score 1 · Accepted Answer · edited May 23 '17 at 12:04

1

You could use XMLUnit for .NET to compare the main document parts (document.xml).

You could get the main document parts using the OpenXML SDK, or System.IO.Packaging. See C# to replace strings of text in a docx for more on the latter approach.

edited May 23 '17 at 12:04

Community

1
1

answered Jun 22 '11 at 22:10

JasonPlutext

15,352
4
44
84

score 0 · Answer 2 · edited May 23 '17 at 11:55

I vaguely heard of checksum comparison.

Checksums work well for comparison of byte by byte exactness. If that's what you are looking for, then read the bytes of each document into a stream and use a SHA256Managed or MD5CryptoServiceProvider to generate a checksum for each file. If the two checksums are the same, then the two documents are most likely the same.

MD5 is not suitable for security purposes (http://en.wikipedia.org/wiki/MD5 - see "Security") but it should be fine for comparison purposes where you are in control of both documents. Also keep in mind that checksums are not 100% unique, so there is always the remote possibility of collision.

I have one that is generated from a template document where some sections are removed dynamically through bookmarks and block sections from the template.

However, if you are comparing section by section, then you may need to open the document up as more than raw bytes and deal with it in a structured fashion, e.g. section by section. You can programmatically open a .docx file using c# (using a variety of means); perhaps you can then perform a checksum against the contents of each section?

This thread talks about creating/manipulating .docx files using c#: How can a Word document be created in C#?. The same tools could be used to read one.

Document validation / comparison

2 Answers2