4

I'm trying to verify that the content generated from wkhtmltopdf is the same from run to run, however every time I run wkhtmltopdf I get a different hash / checksum value against the same page. We are talking something real basic like using an html page of:

<html>
<body>
<p> This is some text</p>
</body
</html>

I get a different md5 or sha256 hash every time I run wkhtmltopdf using an amazing line of:

./wkhtmltopdf example.html ~/Documents/a.pdf

And using a python hasher of:

def shasum(filename):
    sha = hashlib.sha256()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(128*sha.block_size), b''): 
            sha.update(chunk)
    return sha.hexdigest()

or the md5 version which just swaps sha256 with md5

Why would wkhtmltopdf generate a different file enough to cause a different checksum, and is there any way to not do that? some command line that can be passed in to prevent this?

I've tried --default-header, --no-pdf-compression and --disable-smart-shrinking

This is on a MAC osx but I've generated these pdf's on other machines and downloaded them with the same result.

wkhtmltopdf version = 0.10.0 rc2

4 Answers4

2

I tried this and opened the resulting PDF in emacs. wkhtmltopdf is embedding a "/CreationDate" field in the PDF. It will be different for every run, and will screw up the hash values between runs.

I didn't see an option to disable the "/CreationDate" field, but it would be simple to strip it out of the file before computing the hash.

Jeremiah
  • 1,437
  • 8
  • 17
  • 1
    you can use sed, such as in this example that works on OSX: `LANG=C sed 's/\/CreationDate (D:[^)]*)//' < myfile.pdf > strippedfile.pdf` – mwag Apr 29 '16 at 20:12
1

I wrote a method to copy the creation date from the expected output to the current generated file. It's in Ruby and the arguments are any class that walk and quack like IO:

def copy_wkhtmltopdf_creation_date(to, from)
  to_current_pos, from_current_pos = [to.pos, from.pos]
  to.pos = from.pos = 74
  to.write(from.read(14))
  to.pos, from.pos = [to_current_pos, from_current_pos]
end
Kadu Diógenes
  • 508
  • 1
  • 6
  • 19
0

I was inspired by Carlos to write a solution that doesn't use a hardcoded index, since in my documents the index differed from Carlos' 74.

Also, I don't have the files open already. And I handle the case of returning early when no CreationDate is found.

def copy_wkhtmltopdf_creation_date(to, from)
  index, date = File.foreach(from).reduce(0) do |acc, line|
    if line.index("CreationDate")
      break [acc + line.index(/\d{14}/), $~[0]]
    else
      acc + line.bytesize
    end
  end

  if date # IE, yes this is a wkhtmltopdf document
    File.open(to, "r+") do |to|
      to.pos = index
      to.write(date)
    end
  end
end
Gabe Kopley
  • 16,281
  • 5
  • 47
  • 60
0

We solved the problem by stripping the creation date with a simple regex.

preg_replace("/\\/CreationDate \\(D:.*\\)\\n/uim", "", $file_contents, 1);

After doing this we can get a consistent checksum every time.

JordanC
  • 4,339
  • 1
  • 22
  • 16