Assert content of PDF created by wkhtmltopdf (non-deterministic behaviour with multiple fonts)

Question

I have some PHP code that generates HTML and converts it to a PDF using wkhtmltopdf (version 0.12.5).
I also have a PHP test case that asserts the PDF is created, but I also want to assert the content of the PDF.

I already remove the CreationDate field from the PDF before comparing them as suggested in wkhtmltopdf generates a different checksum on every run. But wkhtmltopdf still produces different results for the same input if it has multiple fonts.

Here is a small example HTML file that reproduces the problem:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
    <head>
        <meta charset="utf-8">
    </head>
    <body>
        <div>
            <strong>Hello</strong>
            <span style="text-decoration: underline;">World</span>
        </div>
    </body>
</html>

And a few shell commands to illustrate the problem:

wkhtmltopdf minimal_example.html t1.pdf
sleep 2
wkhtmltopdf minimal_example.html t2.pdf
sed 's#/CreationDate (D:[^)]*)##' t1.pdf > t1_stripped.pdf
sed 's#/CreationDate (D:[^)]*)##' t2.pdf > t2_stripped.pdf
sha256sum t1_stripped.pdf
sha256sum t2_stripped.pdf

Most of the time, it will output two different checksums, although the PDF files look identical. I opened the PDFs in UTF-8 and it seems to me that the order in which the different fonts are defined (bold and underline in this example) is random.

So, now the question is: What is the easiest way to assert that two PDFs are equal? I would prefer to assert as much of the PDF as possible, preferably without new dependencies.

Is there a way to remove the randomness from wkhtmltopdf so I can assert the whole PDF?
If not, what would be the best way to assert the PDF content? Comparing them as images as described here would be possible, though I would like to avoid bringing in two new dependencies (Imagick and GhostScript) just for this.
Are there any other possibilities? Converting the PDF to text will not be sufficient, as I want to assert the different fonts as well.

Possible duplicate of [Comparison of two pdf files](https://stackoverflow.com/questions/6704594/comparison-of-two-pdf-files) — Nico Haase, Apr 15 '19 at 16:14
The linked question is about finding the differences in two PDFs. While one of the answers (https://stackoverflow.com/a/34177834/10380981) could solve my problem, the main question here is whether there is a solution that works without new dependencies. I edited the second question, but I was hoping that there would be another solution. — Arno Hilke, Apr 15 '19 at 16:42

Assert content of PDF created by wkhtmltopdf (non-deterministic behaviour with multiple fonts)

0 Answers0