How to find similarity score between two PDFs stored in HDFS

Question

I have PDFs stored in Hadoop HDFS as unstructured data. I want to find if two PDFs are similar or not and what is the similarity and dissimilarity of these two PDFs.

I am new to this, so it will be very helpful if you can help me with code and its details.

Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. [On topic](http://stackoverflow.com/help/on-topic) and [how to ask](http://stackoverflow.com/help/how-to-ask) apply here. StackOverflow is not a design, coding, research, or tutorial service. — Prune, Jul 20 '18 at 16:56

score 0 · Answer 1 · answered Jul 20 '18 at 07:24

0

If those PDF files are pure text, you can first extract text out of the pdf files with tools in How to extract text from a PDF? and calculate some kind of LSH(for instance simhash) of those text.

Then the distance of two files' LSH can be use as their dissimilarity.

answered Jul 20 '18 at 07:24

shaun shia

1,042
2
9
14

its having image and text both also extraction from hdfs. – Avinav Mishra Jul 20 '18 at 09:57
1

I don't think there is any solution for you out of the box. You need customised code for 1. Read pdf file from hadoop, for example `hadoop fs -cp /data/a.pdf ./` 2. Exact text and image from pdf. 3. Calculate LSH of extracted content. – shaun shia Jul 20 '18 at 11:14

How to find similarity score between two PDFs stored in HDFS

1 Answers1