Comparing a signed PDF to an unsigned PDF using document hash

Question

After extensive google searches, I'm starting to wonder if I'm missing the point of digital signatures in some way.

This is fundamentally what I believe I should be able to do in principle, and I'm hoping iTextSharp will allow me:

I'm writing in C# and .NET and using iTextSharp to parse PDF files. I have an unsigned PDF file, and also a signed version of the same file.

I'm aware a digital signature fundamentally hashes the PDF data, encrypts it with a private key, and then part of the verification process is to decrypt this using the public key and ensure the result matches the PDF data when hashed again.

Additionally to this, I want to get this decrypted document hash, and compare it to a document hash generated from my unsigned PDF. This is because I not only want to verify that the signed PDF is authentic, but also that it's the same unsigned PDF I have on record. I suppose I could also do this by comparing the PDF data (without the signature) with my PDF data on record.

I currently haven't worked out how to do any of this! i.e.:

How do I extract PDF data from a signed PDF excluding the signature?
Alternatively how do I generate a hash from an unsigned PDF?
Along with 2., how do I extract a decrypted hash from a PDF signature?

Hope this is clear, and I haven't missed the point somewhere!

@Lie Ryan, maybe you can you base your solution on this project http://portablesigner.sourceforge.net/. — detunized, Oct 16 '12 at 11:57

score 8 · Answer 1 · edited May 23 '17 at 11:43

About this:

"This is because I not only want to verify that the signed PDF is authentic, but also that it's the same unsigned PDF I have on record"

Assuming you just want to know that a document you get on your server is authentic:

When creating a signed document, you have the choice of signing only one part of the file, or the entire document. You can then use a "whole document" signature, and if the document you get back on your server is "authentic" (which means that the verification of the signature succeeded), then it is for sure the same document you have on record.

It's worth mentioning that there are two types of PDF signatures, approval signatures and certification signatures. From the document Digital Signatures in PDF from Adobe:

(...) approval signatures, where someone signs a document to show consent, approval, or acceptance. A certified document is one that has a certification signature applied by the originator when the document is ready for use. The originator specifies what changes are allowed; choosing one of three levels of modification permitted:

no changes

form fill-in only

form fill-in and commenting

Assuming you want to match certain signed document that you got on your server, with its unsigned equivalent on a database:

For document identification, I would suggest to deal with it separately. Once a document can be opened, a hash (md5 for example) can be created from the concatenation of the decompressed content of all its pages, and then compare it to another similar hash from the original document, (that can be generated once and stored in a database).

The reason I would do it this way is that it will be independent from the type of signature that was used on the document. Even when form fields are edited in a PDF file, or annotations are added, or new signatures are created, the page content is never modified, it will always remain the same.

If you are using iText, you can get a byte array of the page content by using the method PdfReader.getPageContent and use the result for computing a MD5 hash.

The code in Java might look like this:

PdfReader reader = new PdfReader("myfile.pdf");
MessageDigest messageDigest = MessageDigest.getInstance("MD5");
int pageCount = reader.getNumberOfPages(); 
for(int i=1;i <= pageCount; i++)
{
     byte[] buf = reader.getPageContent(i);
     messageDigest.update(buf, 0, buf.length);
}
byte[] hash = messageDigest.digest();

Additionally, if the server receives a file that went out unsigned an came back signed, the signature may refer to just one part of the file and not all. In this scenario, the signature digests might not be enough to identify the file.

From the PDF specification (sections in bold on my account):

Signatures are created by computing a digest of the data (or part of the data) in a document, and storing the digest in the document.(...) There are two defined techniques for computing a reproducible digest of the contents of all or part of a PDF file:

-A byte range digest is computed over a range of bytes in the file, indicated by the the ByteRange entry in the signature dictionary. This range is typically the entire file, including the signature dictionary but excluding the signature value itself (the Contents entry).

-An object digest (PDF 1.5) is computed by selectively walking a subtree of objects in memory, beginning with the referenced object, which is typically the root object. The resulting digest, along with information about how it was computed, is placed in a signature reference dictionary (...).

`Why would you want to do that?`, verifying that the document is actually generated by the server is not really the point. In my case, the user may be downloading multiple unsigned documents, and then they had to put an approval signature on these document (or not, if it is rejected), then they had to upload the signed documents to the right places. I want to be able to check if the user might have made an error and swapped the signed documents (i.e. if they uploaded the wrong document to the wrong place). — Lie Ryan, Oct 18 '12 at 16:25
@Lie Ryan Yes, I understand, I was answering the original question on that case, not yours. In your case, please see the "document identification" part of my answer. — yms, Oct 18 '12 at 16:44

Kevin Stricker · Answer 2 · 2012-10-18T16:51:35.150

5

A strategy of verifying the integrity of a signed PDF:

Don't send out an unsigned PDF in the first place. Using iText (Java version for linux-friendly applications), sign and certify the document using CERTIFIED_FORM_FILLING.
Get the end-user to add their signature to a form field and send it back. This can be done because changes to the form won't break the document certification.
Validate both signatures and the document certification.

You should be able to figure out how to do all of this from the iText documentation: http://itextpdf.sourceforge.net/howtosign.html

All you would need to do to verify that a certified document is the same as an original would be compare the document metadata to the original. The title comes to mind as a potentially good candidate.

To get the title from a pdf to compare using iText you would just use this code:

PdfReader reader = new PdfReader("AsignedPDF.pdf");
string s = reader.Info["Title"];

edited Oct 18 '12 at 16:51

answered Oct 16 '12 at 15:20

Kevin Stricker

17,178
5
45
71

This does not solve the problem at all; there are multiple dynamically documents generated by the server. It is not enough to check that the document is generated by the server and was untampered any other way, I need to check that the document is a signed version of a specific document. – Lie Ryan Oct 18 '12 at 16:16
All you need to do is add *anything* (other than a form field) that uniquely identifies the document type to the document and check for the presence of that. The user can't change that unique identifier without invalidating the certification. – Kevin Stricker Oct 18 '12 at 16:31
I see. They want to see if it's the same document, but different hash so they know it's been changed. – Lee Louviere Oct 18 '12 at 16:48

Comparing a signed PDF to an unsigned PDF using document hash

2 Answers2

Linked