1

I have some PHP code that generates HTML and converts it to a PDF using wkhtmltopdf (version 0.12.5).
I also have a PHP test case that asserts the PDF is created, but I also want to assert the content of the PDF.

I already remove the CreationDate field from the PDF before comparing them as suggested in wkhtmltopdf generates a different checksum on every run. But wkhtmltopdf still produces different results for the same input if it has multiple fonts.

Here is a small example HTML file that reproduces the problem:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
    <head>
        <meta charset="utf-8">
    </head>
    <body>
        <div>
            <strong>Hello</strong>
            <span style="text-decoration: underline;">World</span>
        </div>
    </body>
</html>

And a few shell commands to illustrate the problem:

wkhtmltopdf minimal_example.html t1.pdf
sleep 2
wkhtmltopdf minimal_example.html t2.pdf
sed 's#/CreationDate (D:[^)]*)##' t1.pdf > t1_stripped.pdf
sed 's#/CreationDate (D:[^)]*)##' t2.pdf > t2_stripped.pdf
sha256sum t1_stripped.pdf
sha256sum t2_stripped.pdf 

Most of the time, it will output two different checksums, although the PDF files look identical. I opened the PDFs in UTF-8 and it seems to me that the order in which the different fonts are defined (bold and underline in this example) is random.

So, now the question is: What is the easiest way to assert that two PDFs are equal? I would prefer to assert as much of the PDF as possible, preferably without new dependencies.

  1. Is there a way to remove the randomness from wkhtmltopdf so I can assert the whole PDF?
  2. If not, what would be the best way to assert the PDF content? Comparing them as images as described here would be possible, though I would like to avoid bringing in two new dependencies (Imagick and GhostScript) just for this.
  3. Are there any other possibilities? Converting the PDF to text will not be sufficient, as I want to assert the different fonts as well.
Arno Hilke
  • 1,033
  • 5
  • 12
  • Possible duplicate of [Comparison of two pdf files](https://stackoverflow.com/questions/6704594/comparison-of-two-pdf-files) – Nico Haase Apr 15 '19 at 16:14
  • The linked question is about finding the differences in two PDFs. While one of the answers (https://stackoverflow.com/a/34177834/10380981) could solve my problem, the main question here is whether there is a solution that works without new dependencies. I edited the second question, but I was hoping that there would be another solution. – Arno Hilke Apr 15 '19 at 16:42

0 Answers0