3

I need to compare and get all the visual differences in the two PDF files. I know there are some questions related to this on stack overflow but they are not fulfilling my need.

I'm currently using PDFBox to generate images for pages in PDF and comparing the bytes of the images.

By this approach I'm able to know that particular page is differing.

But I need to find to know some more fine details such as font size of some text, for say - "The text" is differing in the page number, say 6 in the PDFs.

Not only for text but I need to take care of all the visual differences such as images, text in the charts etc.

Please suggest me someway to achieve this.

PS: I tried using Apache Tika but I'm getting the sense that it could be used to get structured text in XHTML and metadata. But I'm seeing the fine details such as font size, font eight is not appearing in structured text. Please correct me if I'm getting it wrong.

unknown_boundaries
  • 1,482
  • 3
  • 25
  • 47
  • 4
    _"...not only for text but I need to take care of all visual differences such as images, text in charts etc..."_. Even an OCR isn't enough for you. Are you **SURE** this is doable? Really SURE??? – Adriano Repetti Jan 23 '14 at 15:06
  • @Adriano Not sure. Okay lets put it in this way, comparing the bytes of the image of the PDF page is not telling me anything where the difference is. Now I need to know something more than the page is differing, what is actually differing. I don't know at what depth the differing details we can get. Does it make sense? – unknown_boundaries Jan 23 '14 at 17:01
  • Yes, if you have PDF or image you can at least give page number but to say what is different (text or visual characteristics) well...this is IMO really too complex to be doable – Adriano Repetti Jan 23 '14 at 17:24
  • It's doable by transforming PDF into image, transforming the image into array of pixels; then, do the same to another PDF and when iterating through the array of the first image, compare the pixel (its color, to be more precise) in that position with the pixel in that same position in the second array. – Rasshu Jan 23 '14 at 17:31
  • 3
    There is actually software that does this - I've previously worked with at least one commercial software provider that had software to automatically compare big batches of PDFs generated by invoicing software and they were able to pinpoint font changes, color changes etc... in quite precise detail. However, this is quite complex to write and it would probably be a good start to list exactly what changes you expect and even how these changes could be caused. That might help determine whether you can use the page description structure to look for changes for example. – David van Driessche Jan 23 '14 at 17:40
  • @DavidvanDriessche can you tell me the name of that commercial software? I am looking something similar to do pdf drawing comparison and merge some of the highlights/notes to the new document. – scc Aug 20 '15 at 21:50

2 Answers2

2

PDF to image using Java

Convert PDF to thumbnail image in Java (there's an example of pdf-renderer use here)

https://www.google.com.br/search?q=PixelGraber&ie=utf-8&oe=utf-8&rls=org.mozilla:pt-BR:official&client=firefox-a&gws_rd=cr&ei=K1PhUqD2Jei0sQTQs4DoAw

A good library for converting PDF to TIFF?

Convert jpeg/png to an array of pixels in java

int pixels array to bmp in java

Finding pixel position

Get Pixel Color around an image

For extraction of text using PDFBox: Extracting text from PDF file using pdfbox

There are classes in PDFBox for detecting font position, type, size and maybe (didn't search deeper) other settings. (Links below) You could, then, extract text from both PDFs, compare them to check if texts are equal, then - if they are equal - compare their format. If there's something different, mark for display into another text, image or PDF.

http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/util/TextPosition.html

http://pdfbox.apache.org/docs/1.8.2/javadocs/org/apache/pdfbox/pdmodel/graphics/PDFontSetting.html

Community
  • 1
  • 1
Rasshu
  • 1,764
  • 6
  • 22
  • 53
  • I did some hands-on with PDFTextStripper class for extracting text and character level formatting information. The thing is it is very complex to compare at character level formatting options (lots of them are there) for two PDFs. Also not able to find any way to get the visual differences in images in PDFs. – unknown_boundaries Jan 23 '14 at 17:56
  • Maybe it is less complex using the techniques I described, that is, doing exactly what you want "by hand". That could also give you more control over the functionality. Or, if it's really more complex, define a specific format you want as "the correct format" (I suppose each one have an unique ID or name) and compare their IDs or names. PDFont class has methot getBaseFont() which returns the PostScript name of the font (String). PDFontSetting has method getFontSize() returning the size of the font (float). – Rasshu Jan 23 '14 at 18:19
0

Check out this Java package: https://java.net/projects/pdf-renderer

You can convert the pdf to an image and then traverse the image as a 2D array and compare differences like that.

mjkaufer
  • 4,047
  • 5
  • 27
  • 55
  • And this package does all that stuff or only the conversion? – Rasshu Jan 23 '14 at 15:14
  • @mjkaufer Have you really read and understood the question? – mkl Jan 23 '14 at 15:16
  • Yes. He is trying to compare two PDFs. You can do all this image based. – mjkaufer Jan 23 '14 at 15:18
  • The logic is transformation of two PDFs into an array of pixels for each file, then iteration through the pixels and, finally, comparison of their color with the pixel of same position in the other arrays. Right? Then it could display the different pixels between both PDFs with 100% opacity in the same file (and different colors for each PDF) and the equivalent pixels with less opacity (something like 50%). – Rasshu Jan 23 '14 at 17:27
  • @RodrigoSiejaBertin You've got it. – mjkaufer Jan 23 '14 at 17:57