In 2022, the answers based on applying compare
directly to PDF files are not working for me. It seems that this command no longer handles PDFs properly.
However, compare
does work when applied to PNG files.
I have adopted bits and pieces from the previous answers to write a different script. In fact, two different scripts, doing slightly different things: ComparePdfs.sh
and ComparePdfs2.sh
, to be executed on the command line. Both scripts are listed at the end of this answer.
Some caveats
These two scripts are comparing the two PDF files page by page, and each pair of pages is compared purely visually (since the pages are converted to PNG). So the scripts are sensitive only to flat text and flat graphics. If the only difference between the two PDF files concerns some other kind of PDF content—such as logical structuring elements, annotations, form-fields, layers, videos, 3D objects (U3D or PRC), etc.—both scripts will nevertheless report that the two PDFs are the same.
I haven't tried to compare PDFs specifically as far as some of this 'extra' kind of content.
How to tell if two files (PDF or not) have completely identical content
The only other kind of comparison I know how to do is the one that lets us know if the contents of the two PDF files are completely identical in every respect, including the various embedded metadata, such as the creation date, the document’s title (which has nothing to do with any title displayed on the first page), the program used to create the PDF, and so on.
It's the same method that can be used to check if any two files (PDF or not) are bit-by-bit identical.
To do this, all you have to do is compute and compare checksums for the two files. I'm including a script for that as well, called AreIdentical.sh
. It is listed at the very end of this question. Here is how to use it.
Suppose the two files are named "my_first_PDF_file.pdf" and "another_PDF_file.pdf". Then, once you execute the following on the command line, the output text will read "same" or "different" depending on whether the two files are the same or different.
AreIdentical.sh my_first_PDF_file.pdf another_PDF_file.pdf
Note that information such as the file's name is not considered when the checksums are computed. The reason is that the name of the file is stored not within the file itself but in the directory entry of the file. So two files may be found to be identical even if their file names are different; see this question. Similarly, the creation date as returned by ls -l
(as opposed to the one that's in the PDF's embedded metadata) is also not considered when checksums are computed, for the same reason.
How to use the scripts ComparePdfs.sh
and ComparePdfs2.sh
We assume that the two pdf files to be (purely visually) compared, file1.pdf and file2.pdf, are in the working directory.
As an example, assume that they both have 4 pages, and that all pages are identical except page 3.
To do exactly what the OP asked for,
on the command line, we execute
ComparePdfs2.sh file1.pdf file2.pdf dif_in_files.pdf
where I picked a particular name, dif_in_files.pdf
, for the outfile. The execution takes a bit of time because for both input PDF files, each individual page must be converted to PNG. The current page being processed is printed in the terminal. At the end, in the working directory, the script will produce the file dif_in_files.pdf
, which contains the difference pages for all the pages. Any differences are highlighted in red.
If we are only interested to see the pages that are different, or only interested to see if they are different, then we use ComparePdfs.sh
.
On the command line, we execute
ComparePdfs.sh file1.pdf file2.pdf
In the terminal, the script will output the following:
page_001: same
page_002: same
page_003: different
page_004: same
For the pages that turned out different, and only those pages, the script will create files that highlight the differences. In the above example, the script would generate just one file, called difference_page_003.png
.
How ComparePdfs.sh
works
For each of the two pdf files, we use pdftk to burst it into individual pages, and then convert each page to PNG. Now we consider the PNGs of the first pages of the two files. We create a checksum for each (I chose to use b2sum
to do that).
If the checksums are the same, we take the first pages of the two files to be the same.
If the checksums are different, we take the first pages of the two files to be different, and use compare
to generate a difference PNG file for them.
We repeat this for each page. At the end, we erase all the .pdf and .png files of the individual pages, except for the difference files.
The scripts
Here is ComparePdfs2.sh
.
#!/bin/bash
file_1="$1"
file_2="$2"
outfile="$3"
# here we set the DPI resolution for the pdftoppm command, which will convert PDF to PNG
resolution=150
# bursting the files into individual pages
pdftk $file_1 burst output ${file_1%.*}---page_%03d.pdf
pdftk $file_2 burst output ${file_2%.*}---page_%03d.pdf
# this will be a string variable in which we collect that names of .png files to be converted to a single .pdf file
DiffFiles=""
# we loop over the individual pages of the first file
for f1 in `echo ${file_1%.*}---`*.pdf
do
# f2 is the name of the PDF of the corresponding page of the second file
f2="${f1/${file_1%.*}/${file_2%.*}}"
# 'b' is an auxilliary varable used to create the variable 'page'
b="${f1/${file_1%.*}---/""}"
# 'page' hold the current page number, e.g. 'page_003'
page="${b/.pdf/}"
# print the current page being processed
echo -n "$page "
# convert the individual page PDFs to PNGs
pdftoppm "$f1" "${f1%.*}" -png -r $resolution
pdftoppm "$f2" "${f2%.*}" -png -r $resolution
# 'g1' and 'g2' are the names of the two PNG files we just created
g1=${f1%.*}-1.png
g2=${f2%.*}-1.png
# create the difference file for this page
compare "$g1" "$g2" ${outfile%.*}_"$page".png
# add the latest name of the difference .png file to the DiffFiles variable
DiffFiles=$DiffFiles""${outfile%.*}_"$page".png" "
done
echo
# convert the .png difference files to a single .pdf file
convert $DiffFiles $outfile
# clean up
rm -f `echo ${file_1%.*}---page_`* `echo ${file_2%.*}---page_`* `echo ${outfile%.*}_page_`* doc_data.txt
Here is ComparePdfs.sh
#!/bin/bash
file_1="$1"
file_2="$2"
# here we set the DPI resolution for the pdftoppm command, which will convert PDF to PNG
resolution=150
# bursting the files into individual pages
pdftk $file_1 burst output ${file_1%.*}---page_%03d.pdf
pdftk $file_2 burst output ${file_2%.*}---page_%03d.pdf
# we loop over the individual pages of the first file
for f1 in `echo ${file_1%.*}---`*.pdf
do
# f2 is the name of the PDF of the corresponding page of the second file
f2="${f1/${file_1%.*}/${file_2%.*}}"
# 'b' is an auxilliary varable used to create the variable 'page'
b="${f1/${file_1%.*}---/""}"
# 'page' hold the current page number, e.g. 'page_003'
page="${b/.pdf/}"
# convert the individual page PDFs to PNGs
pdftoppm "$f1" "${f1%.*}" -png -r $resolution
pdftoppm "$f2" "${f2%.*}" -png -r $resolution
# 'g1' and 'g2' are the names of the two PNG files we just created
g1=${f1%.*}-1.png
g2=${f2%.*}-1.png
# create the checksums for the two PNG files
B2S_1=$(b2sum "$g1" | awk '{print $1}')
B2S_2=$(b2sum "$g2" | awk '{print $1}')
# now we compare the checksums
if [ "$B2S_1" = "$B2S_2" ]; then
echo "$page: same";
else
echo "$page: different";
# if the checksums are different, create a difference PNG image
compare "$g1" "$g2" difference_"$page".png
fi
done
# clean up
rm -f `echo ${file_1%.*}---page_`* `echo ${file_2%.*}---page_`* doc_data.txt
Finally, here is AreIdentical.sh
:
#!/bin/bash
file_1="$1"
file_2="$2"
B2S_1=$(b2sum $file_1 | awk '{print $1}')
B2S_2=$(b2sum $file_2 | awk '{print $1}')
if [ "$B2S_1" = "$B2S_2" ]; then echo "same"; else echo "different"; fi