I am having a table containing multiple records with different or similar or partially similar texts.
For example:
record 1 : Stack overflow forum is very useful. This helps developers and researchers a most. record 2 : There are several very useful forums available that helps developers and researchers.
record 3 : This stack overflow forum is very useful. This helps developers and researchers a most. record 4: This text should not be considered.
consider record 1 and record 3, both are same and it is marked as duplicate as i am generating hash code for the records.
record 4 contains totally different text.
Take a look at record 1 and record 2, both resembles mostly similar meaning and contains nearly similar words.
When comparing both records Percentage of similar words is greater in these two records.
So i need to extract these types of records based on the percentage.
Is there any algorithm related to java to perform this?
It will be useful for me if i get some guidance.