-1

As I need to test some texts to check the duplicate content for SEO purposes.

For it I have 2 texts (in 2 strings S1 & S2) and I need to determine the percentage of similarity between the 2 strings. My first code is OK, it determine the % with

(nbr of common words in S1 & S2)/100x(nbr of words in the shorter string in S1 & S2).

But I am not sure that it is a good algorythm.

Do you have some experience a code example to share ?

philnext
  • 3,242
  • 5
  • 39
  • 62
  • Can you explain a bit more? Do you need to compare the content of URL_A and URL_B and find out if they are duplicate ? Duplicate in the meaning that the content of URL_A is **exactly** the same as content of URL_B ? – Martin Magakian Jul 23 '13 at 09:18
  • 1
    Yow should consider rewording your question, asking for recommendation if a library would be off topic on SO. – bummi Jul 23 '13 at 09:20
  • Question edited to avoid the 'find a library' off-topic. – philnext Jul 24 '13 at 09:11

1 Answers1

2

What you are trying to do is finding the percentage of similarity of two strings.

Some algorithm out there already solve this exact same problem. I been using mainly:

  • LevenshteinDistance
  • NGramDistance

I had a quick search in delphi for the code source. I found the source code for Lenvenshtein in delphi

Lenvenshtein algorith is trying to find in "how many change" it can rollback to the original string.
NGramDistance is comparing the words by splitting them.


So with Lenvenshtein the string "abc def | klm mno" will be see as very different than "klm mn | abc def"
But NGramDistance will see them as 100% similar.

So it depend if you want the order of the string into account.


I couldn't find any source code for the NGramDistance. But you can translate it from Java to Delpi.

The source code in java come from Lucene, an open source search software. They implemented lot more String metric algorithms checkout in this package

Community
  • 1
  • 1
Martin Magakian
  • 3,746
  • 5
  • 37
  • 53