I need to compare the contents of two almost similar files and highlight the dissimilar portions in the corresponding pdf file. Am using pdfbox. Please help me atleast with the logic.
5 Answers
If you prefer a tool with a GUI, you could try this one: diffpdf. It's by Mark Summerfield, and since it's written with Qt, it should be available (or should be buildable) on all platforms where Qt runs on.
Here's a screenshot:

- 86,724
- 23
- 248
- 345
-
Is there any chance you could use this on the CLI, skip the GUI and redirect the output directly to a file? – caw Nov 21 '16 at 22:26
-
@caw: (1) Did you see [my other answer](http://stackoverflow.com/a/6737451/359307)? -- (2) AFAIK, newer versions of DiffPDF can redirect output to a CSV file. I don't know if this completely skips the GUI, though. -- (3) There is a "purely-CLI" version of DiffPDF available, called *DiffPDFc*, to be found at [www.qtrac.eu](http://www.qtrac.eu/) -- however, it is for Windows only. – Kurt Pfeifle Nov 21 '16 at 22:44
-
I haven't, but tried ImageMagick, `pdftk` and Ghostscript before. Not in that combination, but separately. Since the results of `diffpdf` are so good, in fact excellent, I had hoped that all this functionality which is already there could just be used to redirect into a PDF on the CLI. What a pity! Thanks for the information on the other versions of that tool as well. Unfortunately, newer versions are not open-source anymore and Windows-only is not perfect, either. – caw Nov 21 '16 at 22:51
You can do the same thing with a shell script on Linux. The script wraps 3 components:
- ImageMagick's
compare
command - the
pdftk
utility - Ghostscript
It's rather easy to translate this into a .bat
Batch file for DOS/Windows...
Here are the building blocks:
pdftk
Use this command to split multipage PDF files into multiple singlepage PDFs:
pdftk first.pdf burst output somewhere/firstpdf_page_%03d.pdf
pdftk 2nd.pdf burst output somewhere/2ndpdf_page_%03d.pdf
compare
Use this command to create a "diff" PDF page for each of the pages:
compare \
-verbose \
-debug coder -log "%u %m:%l %e" \
somewhere/firstpdf_page_001.pdf \
somewhere/2ndpdf_page_001.pdf \
-compose src \
somewhereelse/diff_page_001.pdf
Note, that compare
is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.
Once more, pdftk
Now you can again concatenate your "diff" PDF pages with pdftk
:
pdftk \
somewhereelse/diff_page_*.pdf \
cat \
output somewhereelse/diff_allpages.pdf
Ghostscript
Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.
If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the bmp256
output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:
gs \
-o diff_page_001.bmp \
-r72 \
-g595x842 \
-sDEVICE=bmp256 \
diff_page_001.pdf
md5sum diff_page_001.bmp
Just create an all-white BMP page with its MD5sum (for reference) like this:
gs \
-o reference-white-page.bmp \
-r72 \
-g595x842 \
-sDEVICE=bmp256 \
-c "showpage quit"
md5sum reference-white-page.bmp

- 86,724
- 23
- 248
- 345
-
Here's a script to visually diff two PDFs page-by-page using ImageMagick and Poppler tools (for speed): https://gist.github.com/brechtm/891de9f72516c1b2cbc1. It outputs one JPG for each page of the PDFs in a `pdfdiff` directory and additionally prints the numbers of the pages which differ between the two PDFs. – Brecht Machiels Mar 31 '16 at 13:38
I had this very problem myself and the quickest way that I've found is to use PHP and its bindings for ImageMagick (Imagick).
<?php
$im1 = new \Imagick("file1.pdf");
$im2 = new \Imagick("file2.pdf");
$result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);
if($result[1] > 0.0){
// Files are DIFFERENT
}
else{
// Files are IDENTICAL
}
$im1->destroy();
$im2->destroy();
Of course, you need to install the ImageMagick bindings first:
sudo apt-get install php5-imagick # Ubuntu/Debian

- 3,904
- 1
- 22
- 15
I have come up with a jar using apache pdfbox to compare pdf files - this can compare pixel by pixel
& highlight the differences.
Check my blog : http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ for example & download.
To get page count
import com.taguru.utility.PDFUtil;
PDFUtil pdfUtil = new PDFUtil();
pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count
To get page content as plain text
//returns the pdf content - all pages
pdfUtil.getText("c:/sample.pdf");
// returns the pdf content from page number 2
pdfUtil.getText("c:/sample.pdf",2);
// returns the pdf content from page number 5 to 8
pdfUtil.getText("c:/sample.pdf", 5, 8);
To extract attached images from PDF
//set the path where we need to store the images
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.extractImages("c:/sample.pdf");
// extracts & saves the pdf content from page number 3
pdfUtil.extractImages("c:/sample.pdf", 3);
// extracts & saves the pdf content from page 2
pdfUtil.extractImages("c:/sample.pdf", 2, 2);
To store PDF pages as images
//set the path where we need to store the images
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.savePdfAsImage("c:/sample.pdf");
To compare PDF files in text mode (faster – But it does not compare the format, images etc in the PDF)
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";
// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesTextMode(file1, file2);
// compare the 3rd page alone
pdfUtil.comparePdfFilesTextMode(file1, file2, 3, 3);
// compare the pages from 1 to 5
pdfUtil.comparePdfFilesTextMode(file1, file2, 1, 5);
To compare PDF files in Binary mode (slower – compares PDF documents pixel by pixel – highlights pdf difference & store the result as image)
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";
// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesBinaryMode(file1, file2);
// compare the 3rd page alone
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 3, 3);
// compare the pages from 1 to 5
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 1, 5);
//if you need to store the result
pdfUtil.highlightPdfDifference(true);
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

- 15,030
- 3
- 36
- 47
-
I got an error when tried to download that file "The transferred file contained a virus and was therefore blocked. URL: http://www.testautomationguru.com/download/304/ Media Type: application/java-vm Virus Name: McAfeeGW: BehavesLike.Java.Suspicious.xm" – scc Aug 20 '15 at 22:07
-
I try to run the jar file from above mentioned site,but I am getting error like "no main manifest attribute, in taguru-pdf-util.jar", could you please help me on this – Nachiappan R Sep 12 '16 at 12:51
To compare PDFs on macOS Monterey (i.e. version 12), I was able to install diff-pdf using homebrew, and run it.
The --view
option didn't work for me, but the --output-diff
did.

- 4,863
- 1
- 38
- 48