Compare PDF Content With Ruby

Question

I am in the process of writing a Ruby script/app that helps me compiling LaTeX to (at least) PDF. One feature I want it to have is that it should run pdflatex iteratively until the PDF converges (as it should, I guess).

The idea is to compare the PDF generated in one iteration against the one from the former iteration using their fingerprints. In particular, I currently use Digest::MD5.file(.).

The problem now is that this never converges. A (The, hopefully) culprit is the PDF's timestamp that is set to the seconds at least by pdflatex. Since runs of pdflatex take typically longer than one second, the result keeps changing. That is, I expect the PDF's to be equal up to the timestamp(s) after some point. This assumption might be wrong; hints appreciated.

What can I do about this? My basic ideas so far:

Use a library capable of doing the job
Strip meta data away and only hash PDF content
Overwrite timestamps by a fixed value before comparing

Do you have more ideas or even solutions? Solutions should only use free software that runs on Linux. Such that only use Ruby are preferred, but using external software is perfectly acceptable.

By the way, I do not exactly know how PDF is encoded but I suspect that merely comparing the contained text won't work for me since only graphics or links might change in later iterations.

Possibly related:

How to compare two PDF files? (Messy, text-based or proprietary solutions)
Functional PDF Testing (Uses a Java library; not clear wether it is up to the job)

There are pathological cases where it will not converge. This can happen when a reference causes the layout to change thereby shifting the item referenced to a different page so the reference changes and the item referenced changes back so the reference has to change back etc. Granted, such cases are exceptionally rare, but you might want to take them into consideration. It is not difficult to construct such an example for testing. — Ivan Andrus, Feb 19 '12 at 12:40
True. The user is bound to notice this (after, say, 10 iterations) and can for those cases impose an iteration limit. The goal here is to deal with most cases automatically. — Raphael, Feb 19 '12 at 14:10

score 8 · Accepted Answer · edited May 06 '22 at 14:49

8

This is probably not the most bullet-proof solution, but it works for me:

grep -av -e '^/CreationDate' -e '^/ModDate' -e '^/ID' file.pdf | md5sum

or from Ruby

`grep -av -e '^/CreationDate' -e '^/ModDate' -e '^/ID' file.pdf | md5sum`.chop!

This computes the PDF's hash after dropping the lines that cause supposedly identical PDFs to differ.

YMMV, depending on your PDF creator. To find out what other lines you need to drop, use

diff -a file-1.pdf file-2.pdf | less

edited May 06 '22 at 14:49

ma11hew28

121,420
116
450
651

answered Feb 18 '12 at 22:26

Raphael

9,779
5
63
94

The above works for `pdflatex` output, but not for e.g. `xelatex`. For some variants in the context of LaTeX, see [the respective engines](https://github.com/akerbos/ltx2any/tree/master/engines) of said script. – Raphael Oct 08 '13 at 23:13
@ma11hew28 Thanks, but I'm gonna roll that back. I think you wanted to post a new answer. – Raphael May 05 '22 at 18:39
1

OK, thanks for letting me know. Yeah, part of me wanted to post a whole new question and answer, but I didn't want to create duplicate content, and this question and your answer seemed rather close to what I wanted to share. I especially wanted to highlight `diff -a` because after seeing `grep -a` from your answer, I spent six hours writing a shell script that uses `grep -a` to basically do what `diff -a` does. Then, I saw `diff -a` when perusing the output of `man diff`. Then, I saw that you mention `diff -a` later in your answer. Doh! Maybe you'll accept my other edits. – ma11hew28 May 06 '22 at 14:51
1

@ma11hew28 I see, fair enough. :) Thanks for the small tweaks! – Raphael May 17 '22 at 19:42

score 3 · Answer 2 · answered Jul 09 '19 at 06:48

[Disclaimer: I'm the author of Identikal]

For a project we had a requirement to compare two PDFs in pure Ruby. Ended up writing a gem called identikal. This gem compares two unencrypted PDF files and returns true if they are identical and false otherwise.

Once you install the gem you can compare two PDFs as shown below:

$ identikal file_a.pdf file_b.pdf
true

score 0 · Answer 3 · answered Jan 25 '11 at 19:42

0

This isn't an answer to your question, but are you familiar with latexmk? It's a perl script that does exactly what you're after, but achieves it in a very different way. It does so by examining all the different .log and .aux files left around from each tex run, and then has heuristics about what needs to happen in each case (which may be more complicated than simply re-running tex -- mkindex or xindy may need to be run, as well).

You could either mimic its usage (although with 3546 sloc, I don't particularly recommend it) or simply call it from your Ruby script/app.

answered Jan 25 '11 at 19:42

mbauman

30,958
4
88
123

Thanks. I actually already have a bashscript that does very much what I want but not as flexible as I liked, therefore the rebuild. In particular, convergence detection is missing. I built it back then because I could not get any other solution I could find to work properly so maybe I tried latexmk back then. For the record, by script will detect what is necessary, too. Will definitely implement bibtex, mkindex and mpost. – Raphael Jan 25 '11 at 20:43

sawa · Answer 4 · 2011-03-06T00:36:59.520

0

Since a latex run does not have access to its previous runs, and is only dependent, (besides system parameters such as the current time), on the text files generated (such as tex, aux, bib, ...), the resulting pdf file converges once all those text files converges (disregarding dependency on system paramters sudh as time).

In short, you should check the convergence of the text files (tex, aux, bib, ...) rather than the convergence of the pdf file.

Make directory A, where you run latex.
Make directory B, where you keep a copy of the text files resulting from the previous latex run.
Run latex within A
If the contents of all the files in B are the same as the contents of the corresponding files in A, then stop. Otherwise, copy all the text files generated in A (aux, bib, ...) to B, excluding the original tex file if you know that it didn't change. You can also exclude log from the copy list. And then, return to 3.

edited Mar 06 '11 at 00:36

answered Mar 05 '11 at 23:42

sawa

165,429
45
277
381

Thanks for that take. However, are you certain that the text files actually converge? In particular, does not a single one contain a timestamp? Also, how can I know all created text files without knowing what kind of extra programs are being run? Maybe something outside of the known set of text files has not yet converged? I could, of course, monitor all files but the pdf that way, but then, again, does this set converge? Guess I will have to try. – Raphael Mar 06 '11 at 15:39
Acutally, the log file includes a time stamp, so it does not converge, and you should not monitor it.Even if a program outside of latex runs, the only way it interacts with latex is by leaving a file, and letting (La)TeX read it, so if you check the original directory, you will know – sawa Mar 07 '11 at 00:36
all the relevant files. If it happens that files other than log include a time stamp, then you need to use regexp to exclude that when comparing the files. I think that's still easier than dealing with binary files (but am not sure how you feel). Another situation in which the text files do not converge is when you, for example, refer to a time stamp. In that case, the value of the time will appear and chage on aux, but I guess you do not have that usage, do you? – sawa Mar 07 '11 at 00:48
I try to write a general tool, so I want to make as few assumptions (e.g. what contains timestamps in which format) as possible. I think hashing the (binary) data part of the PDF is the easiest solution, provided I can extract that part, i.e. cut the header off a PDF. Therefore the question. – Raphael Mar 07 '11 at 13:47

Compare PDF Content With Ruby

4 Answers4