I have a Java application where I am extracting the text content from a PDF file using PDFBox.
The problem is that I shouldn't have duplicates among the PDFs I am storing on server and if a user tries to upload an already existing PDF I should send them a warning.
I was thinking to generate an UID based on the content of the PDF and if two PDFs are identical they should have the same UID, but I don't know if this is possible. I also heard about Lucene, but I don't really understand if it is suitable for what I want.
Is there any approach which can accomplish this?