How to Identify records containing similar texts from a mssql table

Question

I am having a table containing multiple records with different or similar or partially similar texts.

For example:

record 1 : Stack overflow forum is very useful. This helps developers and researchers a most. record 2 : There are several very useful forums available that helps developers and researchers.

record 3 : This stack overflow forum is very useful. This helps developers and researchers a most. record 4: This text should not be considered.

consider record 1 and record 3, both are same and it is marked as duplicate as i am generating hash code for the records.

record 4 contains totally different text.

Take a look at record 1 and record 2, both resembles mostly similar meaning and contains nearly similar words.

When comparing both records Percentage of similar words is greater in these two records.

So i need to extract these types of records based on the percentage.

Is there any algorithm related to java to perform this?

It will be useful for me if i get some guidance.

Your actual question is: I need a Java algorithm to calculate string similarity. The other 90% of the question text is irrelevant. And I suggest you google for that first, because asking for resources is off-topic here. — Jan Doggen, Mar 20 '15 at 09:20

score 0 · Answer 1 · edited May 23 '17 at 11:56

0

you can use fuzzy string search for your requirement. May be this post help you out. Or for search in DB you can also use Hibernate search. See Hibernate Querying

edited May 23 '17 at 11:56

Community

1
1

answered Mar 20 '15 at 08:42

Prabhat

338
4
20

How to Identify records containing similar texts from a mssql table

1 Answers1