I have some PHP code that generates HTML and converts it to a PDF using wkhtmltopdf (version 0.12.5).
I also have a PHP test case that asserts the PDF is created, but I also want to assert the content of the PDF.
I already remove the CreationDate field from the PDF before comparing them as suggested in wkhtmltopdf generates a different checksum on every run. But wkhtmltopdf still produces different results for the same input if it has multiple fonts.
Here is a small example HTML file that reproduces the problem:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<div>
<strong>Hello</strong>
<span style="text-decoration: underline;">World</span>
</div>
</body>
</html>
And a few shell commands to illustrate the problem:
wkhtmltopdf minimal_example.html t1.pdf
sleep 2
wkhtmltopdf minimal_example.html t2.pdf
sed 's#/CreationDate (D:[^)]*)##' t1.pdf > t1_stripped.pdf
sed 's#/CreationDate (D:[^)]*)##' t2.pdf > t2_stripped.pdf
sha256sum t1_stripped.pdf
sha256sum t2_stripped.pdf
Most of the time, it will output two different checksums, although the PDF files look identical. I opened the PDFs in UTF-8 and it seems to me that the order in which the different fonts are defined (bold and underline in this example) is random.
So, now the question is: What is the easiest way to assert that two PDFs are equal? I would prefer to assert as much of the PDF as possible, preferably without new dependencies.
- Is there a way to remove the randomness from wkhtmltopdf so I can assert the whole PDF?
- If not, what would be the best way to assert the PDF content? Comparing them as images as described here would be possible, though I would like to avoid bringing in two new dependencies (Imagick and GhostScript) just for this.
- Are there any other possibilities? Converting the PDF to text will not be sufficient, as I want to assert the different fonts as well.