6

I have two text files which I'd like to compare. What I did is:

  1. I've split both of them into sentences.
  2. I've measured levenshtein distance between each of the sentences from one file with each of the sentences from second file.

I'd like to calculate average similarity between those two text files, however I have trouble to deliver any meaningful value - obviously arithmetic mean (sum of all the distances [normalized] divided by number of comparisions) is a bad idea.

How to interpret such results?

edit: Distance values are normalized.

user2207055
  • 91
  • 1
  • 4
  • 1
    You could normalize the distances, `d(A,B) / max(length(A), length(B))`, then report the arithmetic mean. – Fred Foo Mar 25 '13 at 10:30
  • @larsmans, distances are already normalized. – user2207055 Mar 25 '13 at 10:44
  • 1
    Then why is the mean a bad idea? – Fred Foo Mar 25 '13 at 13:39
  • @larsmans due to high volume of data: consider i've found 1 perfect match and 99 matches with distance > 0.8. Mean won't be noticeably different from 100 matches > 0.8. – user2207055 Mar 25 '13 at 14:15
  • @user2207055 Maybe it would help if you said what kind of text you're comparing and what you do with the results. - I'd be worried about sentences with small edit distances with completely different meanings. (`"This is a complete proof."` vs `"This is completely foolish."`) – us2012 Mar 25 '13 at 18:19

1 Answers1

16

The levenshtein distances has a maximum value, i.e. the max. length of both input strings. It cannot get worse than that. So a normalized similarity index (0=bad, 1=match) for two strings a and b can be calculated as 1- distance(a,b)/max(a.length, b.length).

Take one sentence from File A. You said you'd compare this to each sentence of File B. I guess you are looking for a sentence out of B which has the smallest distance (i.e. the highest similarity index).

Simply calculate the average of all those 'minimum similarity indexes'. This should give you a rough estimation of the similarity of two texts.

But what makes you think that two texts which are similar might have their sentences shuffled? My personal opinion is that you should also introduce stop word lists, synonyms and all that.

Nevertheless: Please also check trigram matching which might be another good approach to what you are looking for.

alzaimar
  • 4,572
  • 1
  • 16
  • 30
  • It also has a minimum value that is often different from 0 and is abs(a.length - b.length). So proper normalization would be (distance(a,b)-minval) / (maxval-minval) – blues Mar 23 '17 at 20:24
  • Are you sure? Check "x" vs. "The quick frown box". Your definition yields 0 (d=19, minval=18). These to strings are definitely not equal. – alzaimar Mar 25 '17 at 08:33