1

I have to check the percentage of the similarity between two files. I am using a maven dependency "string-similarity" to find the percentage of similarity between two strings.

   <dependency>
        <groupId>net.ricecode</groupId>
        <artifactId>string-similarity</artifactId>
        <version>1.0.0</version>
    </dependency>

    <dependency>
        <groupId>info.debatty</groupId>
        <artifactId>java-string-similarity</artifactId>
        <version>RELEASE</version>
    </dependency> 

is there any dependencies to compare the contents of two doc/pdf files.

dijo francis
  • 153
  • 1
  • 2
  • 15
  • 2
    I guess you would have to extract the text from the documents (can be tricky), and then use something like this: https://stackoverflow.com/questions/2898612/are-there-java-libraries-to-do-a-word-based-diff?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa If you have the option to work with txt files, it will be much easier to accurately extract the text. – Iakovos Apr 23 '18 at 09:38
  • 1
    hi jack, but if the pdf file has images. how should I compare them? – dijo francis Apr 23 '18 at 09:42
  • 1
    This is not an easy task. Apart from extracting the images from the PDF (which I have not tried, but I think that at least for some PDFs itn will be hard), comparing them is even harder because a) you should know which image of one PDF you should compare against which of the other, and b) there are plenty of comparison algorithms, some of which will work better for certain images, and for others they will not. Unfortunately it is a huge topic that cannot be answered in a single question. If, however, you only care to see whether 2 PDFs are identical, it is easy, using hash functions (e.g. SHA) – Iakovos Apr 23 '18 at 14:21
  • 1
    thank you jack, is it possible to compare both files using byte stream – dijo francis Apr 23 '18 at 14:27
  • 1
    Something similar to this? https://superuser.com/questions/125376/how-do-i-compare-binary-files-in-linux/?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa Yes it is, but I would still try to avoid using PDFs. If you cannot, please read this: https://stackoverflow.com/questions/21816049/read-a-binary-file-in-android?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa and then I would use a tool such as wdiff to compare. If you cannot find wdiff for Android, convert the binary to 0 and 1s and write each bit in a new line and compare using diff. – Iakovos Apr 23 '18 at 15:33

0 Answers0