10

I need to compare the contents of two almost similar files and highlight the dissimilar portions in the corresponding pdf file. Am using pdfbox. Please help me atleast with the logic.

Aisharjya Sarkar
  • 101
  • 1
  • 1
  • 3

5 Answers5

7

If you prefer a tool with a GUI, you could try this one: diffpdf. It's by Mark Summerfield, and since it's written with Qt, it should be available (or should be buildable) on all platforms where Qt runs on.

Here's a screenshot:enter image description here

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Is there any chance you could use this on the CLI, skip the GUI and redirect the output directly to a file? – caw Nov 21 '16 at 22:26
  • @caw: (1) Did you see [my other answer](http://stackoverflow.com/a/6737451/359307)? -- (2) AFAIK, newer versions of DiffPDF can redirect output to a CSV file. I don't know if this completely skips the GUI, though. -- (3) There is a "purely-CLI" version of DiffPDF available, called *DiffPDFc*, to be found at [www.qtrac.eu](http://www.qtrac.eu/) -- however, it is for Windows only. – Kurt Pfeifle Nov 21 '16 at 22:44
  • I haven't, but tried ImageMagick, `pdftk` and Ghostscript before. Not in that combination, but separately. Since the results of `diffpdf` are so good, in fact excellent, I had hoped that all this functionality which is already there could just be used to redirect into a PDF on the CLI. What a pity! Thanks for the information on the other versions of that tool as well. Unfortunately, newer versions are not open-source anymore and Windows-only is not perfect, either. – caw Nov 21 '16 at 22:51
5

You can do the same thing with a shell script on Linux. The script wraps 3 components:

  1. ImageMagick's compare command
  2. the pdftk utility
  3. Ghostscript

It's rather easy to translate this into a .bat Batch file for DOS/Windows...

Here are the building blocks:

pdftk

Use this command to split multipage PDF files into multiple singlepage PDFs:

pdftk  first.pdf  burst  output  somewhere/firstpdf_page_%03d.pdf
pdftk  2nd.pdf    burst  output  somewhere/2ndpdf_page_%03d.pdf

compare

Use this command to create a "diff" PDF page for each of the pages:

compare \
       -verbose \
       -debug coder -log "%u %m:%l %e" \
        somewhere/firstpdf_page_001.pdf \
        somewhere/2ndpdf_page_001.pdf \
       -compose src \
        somewhereelse/diff_page_001.pdf

Note, that compare is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.

Once more, pdftk

Now you can again concatenate your "diff" PDF pages with pdftk:

pdftk \
      somewhereelse/diff_page_*.pdf \
      cat \
      output somewhereelse/diff_allpages.pdf

Ghostscript

Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.

If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the bmp256 output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:

 gs \
   -o diff_page_001.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
    diff_page_001.pdf

 md5sum diff_page_001.bmp

Just create an all-white BMP page with its MD5sum (for reference) like this:

 gs \
   -o reference-white-page.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
   -c "showpage quit"

 md5sum reference-white-page.bmp
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Here's a script to visually diff two PDFs page-by-page using ImageMagick and Poppler tools (for speed): https://gist.github.com/brechtm/891de9f72516c1b2cbc1. It outputs one JPG for each page of the PDFs in a `pdfdiff` directory and additionally prints the numbers of the pages which differ between the two PDFs. – Brecht Machiels Mar 31 '16 at 13:38
4

I had this very problem myself and the quickest way that I've found is to use PHP and its bindings for ImageMagick (Imagick).

<?php
$im1 = new \Imagick("file1.pdf");
$im2 = new \Imagick("file2.pdf");

$result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);

if($result[1] > 0.0){
    // Files are DIFFERENT
}
else{
    // Files are IDENTICAL
}

$im1->destroy();
$im2->destroy();

Of course, you need to install the ImageMagick bindings first:

sudo apt-get install php5-imagick # Ubuntu/Debian
paul.ago
  • 3,904
  • 1
  • 22
  • 15
0

I have come up with a jar using apache pdfbox to compare pdf files - this can compare pixel by pixel & highlight the differences.

Check my blog : http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ for example & download.


To get page count

import com.taguru.utility.PDFUtil;

PDFUtil pdfUtil = new PDFUtil();
pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count

To get page content as plain text

//returns the pdf content - all pages
pdfUtil.getText("c:/sample.pdf");

// returns the pdf content from page number 2
pdfUtil.getText("c:/sample.pdf",2);

// returns the pdf content from page number 5 to 8
pdfUtil.getText("c:/sample.pdf", 5, 8);

To extract attached images from PDF

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.extractImages("c:/sample.pdf");

// extracts & saves the pdf content from page number 3
pdfUtil.extractImages("c:/sample.pdf", 3);

// extracts & saves the pdf content from page 2
pdfUtil.extractImages("c:/sample.pdf", 2, 2);

To store PDF pages as images

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.savePdfAsImage("c:/sample.pdf");

To compare PDF files in text mode (faster – But it does not compare the format, images etc in the PDF)

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesTextMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesTextMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesTextMode(file1, file2, 1, 5);

To compare PDF files in Binary mode (slower – compares PDF documents pixel by pixel – highlights pdf difference & store the result as image)

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 1, 5);

//if you need to store the result
pdfUtil.highlightPdfDifference(true);
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.comparePdfFilesBinaryMode(file1, file2);
vins
  • 15,030
  • 3
  • 36
  • 47
  • I got an error when tried to download that file "The transferred file contained a virus and was therefore blocked. URL: http://www.testautomationguru.com/download/304/ Media Type: application/java-vm Virus Name: McAfeeGW: BehavesLike.Java.Suspicious.xm" – scc Aug 20 '15 at 22:07
  • I try to run the jar file from above mentioned site,but I am getting error like "no main manifest attribute, in taguru-pdf-util.jar", could you please help me on this – Nachiappan R Sep 12 '16 at 12:51
0

To compare PDFs on macOS Monterey (i.e. version 12), I was able to install diff-pdf using homebrew, and run it.

The --view option didn't work for me, but the --output-diff did.

Greg Sadetsky
  • 4,863
  • 1
  • 38
  • 48