0

I have a Java application where I am extracting the text content from a PDF file using PDFBox.

The problem is that I shouldn't have duplicates among the PDFs I am storing on server and if a user tries to upload an already existing PDF I should send them a warning.

I was thinking to generate an UID based on the content of the PDF and if two PDFs are identical they should have the same UID, but I don't know if this is possible. I also heard about Lucene, but I don't really understand if it is suitable for what I want.

Is there any approach which can accomplish this?

  • 1
    What you are describing is a hash. Check out sha256 – le3th4x0rbot Mar 11 '17 at 17:42
  • A *hash function* like md5() does exactly what your looking for, it takes an arbitrary amount of data and outputs a short value that is unique to the input. – Alex K. Mar 11 '17 at 17:43
  • 1
    @AlexK. don't use MD5 or SHA1, collisions are too easy. The answer by "Bailey S" is fine. – Tilman Hausherr Mar 11 '17 at 17:48
  • Possible duplicate of [How to calculate hash value of a file in Java?](http://stackoverflow.com/questions/32032851/how-to-calculate-hash-value-of-a-file-in-java) – Tilman Hausherr Mar 11 '17 at 17:50
  • @TilmanHausherr the risk of a collision is vanishingly small & would not be worth thinking about. This is not a security context where an attacker would contrive collisions. – Alex K. Mar 11 '17 at 18:08

0 Answers0