-2

I have PDFs stored in Hadoop HDFS as unstructured data. I want to find if two PDFs are similar or not and what is the similarity and dissimilarity of these two PDFs.

I am new to this, so it will be very helpful if you can help me with code and its details.

emshore
  • 489
  • 6
  • 15
Avinav Mishra
  • 718
  • 9
  • 12
  • Removed ML tag, which is not for machine learning. – Andreas Rossberg Jul 20 '18 at 07:14
  • Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. [On topic](http://stackoverflow.com/help/on-topic) and [how to ask](http://stackoverflow.com/help/how-to-ask) apply here. StackOverflow is not a design, coding, research, or tutorial service. – Prune Jul 20 '18 at 16:56

1 Answers1

0

If those PDF files are pure text, you can first extract text out of the pdf files with tools in How to extract text from a PDF? and calculate some kind of LSH(for instance simhash) of those text.

Then the distance of two files' LSH can be use as their dissimilarity.

shaun shia
  • 1,042
  • 2
  • 9
  • 14
  • its having image and text both also extraction from hdfs. – Avinav Mishra Jul 20 '18 at 09:57
  • 1
    I don't think there is any solution for you out of the box. You need customised code for 1. Read pdf file from hadoop, for example `hadoop fs -cp /data/a.pdf ./` 2. Exact text and image from pdf. 3. Calculate LSH of extracted content. – shaun shia Jul 20 '18 at 11:14