1

I use google-diff-match-patch C# library. I want to measure the similarity between two texts. To do this I make this C# code :

List<DiffMatchPatch.Diff> lDiffs = dmpDiff.diff_main(sTexte1, sTexte2);
int iIndex = dmpDiff.diff_levenshtein(lDiffs);
double dsimilarity = 100 - ((double)iIndex / Math.Max(sTexte1.Length, sTexte2.Length) * 100);

With similarity values between 0 - 100 (0 == perfect match - 100 == totaly different).

Do you think this is a good approach, that this calculation is correct?

LeMoussel
  • 5,290
  • 12
  • 69
  • 122

1 Answers1

1

I've had a look at diff_levenshtein on the API home page and it gives this description

Given a diff, measure its Levenshtein distance in terms of the number of inserted, deleted or substituted characters. The minimum distance is 0 which means equality, the maximum distance is the length of the longer string.

In the following line, all you are turning the distance (the change measurement) into a percentage of the original string length, and then substracting it from one hundred.

double dsimilarity = 100 - ((double)iIndex / Math.Max(sTexte1.Length, sTexte2.Length) * 100);

So, yes, this looks fine to me.

My only comment would be that the original algorithm uses 0 to represent a perfect match and you are using 100, which might be confusing. If you are ok with this, make your you comment it appropriately for future maintainers.

James Wiseman
  • 29,946
  • 17
  • 95
  • 158
  • I'm OK. I make comment appropriately. – LeMoussel Sep 30 '13 at 09:15
  • James, minors pb. with words "considéré" & "apprécié", `iIndex = 11` then `dsimilarity = -22.22`. Should take the absolute value? – LeMoussel Oct 01 '13 at 06:40
  • If you're getting a `dsimilarity` of -22.22, then you have a `Math.Max(sTexte1.Length, sTexte2.Length)` of `9`. According to the docs this is not possible (you should not get a distance equal or less than the string length). I think the problem is with accented characters. There is a MSQL question still open with a similar problem. You may want to consider offereing a bounty for that or asking your own. http://stackoverflow.com/questions/10426732/levenshtein-algorithm-in-mysql-and-accented-characters – James Wiseman Oct 01 '13 at 08:35
  • 1
    I posted a response in google-diff-match-patch google group (https://groups.google.com/forum/#!topic/diff-match-patch/RXxj5-fxMKs) There is issue #87 about this (https://code.google.com/p/google-diff-match-patch/issues/detail?id=87) Strange, very strange .... – LeMoussel Oct 01 '13 at 10:49