I have PDFs stored in Hadoop HDFS as unstructured data. I want to find if two PDFs are similar or not and what is the similarity and dissimilarity of these two PDFs.
I am new to this, so it will be very helpful if you can help me with code and its details.