Algorithm for measuring distance between disordered sequences

Question

The Levenshtein distance gives us a way to calculate the distance between two similar strings in terms of disordered individual characters:

quick brown fox
quikc brown fax

The Levenshtein distance = 3.

What is a similar algorithm for the distance between two strings with similar subsequences? For example, in

quickbrownfox
brownquickfox

the Levenshtein distance is 10, but this takes no account of the fact that the strings have two similar subsequences, which makes them more "similar" than completely disordered words like

quickbrownfox
qburiocwknfox

and yet this completely disordered version has a Levenshtein distance of eight.

What distance measures exist which take the length of subsequences into account, without assuming that the subsequences can be easily broken into distinct words?

How is this off-topic? Maybe one could just improve the title. — Dario, May 18 '10 at 11:26
Was asked many times under better name :o) http://stackoverflow.com/questions/451884/similar-string-algorithm or http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings or http://stackoverflow.com/questions/246961/algorithm-to-find-similar-text Btw: I especially like the idea with compression based distance. — MaR, May 18 '10 at 11:27
@MaR: those questions are not the same as this question. The point is that there is no obvious way to break the string into words. — , May 18 '10 at 11:30
Also interesting page comparing different string similarity metrics: http://www.dcs.shef.ac.uk/~sam/stringmetrics.html Best seems to be SmithWatermanGotoh metric in this comparison. — MaR, May 18 '10 at 11:32

score 1 · Answer 1 · answered May 18 '10 at 14:12

1

I think that you can try shingles or some combinations of them with Levenshtein distance.

answered May 18 '10 at 14:12

Manvel

84
1
3

mathmike · Answer 2 · 2019-09-29T14:59:55.817

1

One simple metric would be to take all n*(n-1)/2 substrings in each string, and see how many overlap. There are some simple variations to this approach where you only look at substrings up to a certain length.

This would be similar to the BLEU score commonly used to evaluate machine translations. In the case of BLEU, they are comparing two sentences: they take all the unigrams, bigrams, trigrams, and 4-grams of words from each sentence. They calculate a version of precision and recall for each, and essentially use an average of those scores.

edited Sep 29 '19 at 14:59

answered May 19 '10 at 06:20

mathmike

1,014
5
10

The link doesn't work anymore, but the answer is spot on. – ben26941 May 03 '19 at 10:20

score 0 · Answer 3 · answered May 18 '10 at 11:29

0

Initial stab: use a diff algorithm and the count of the number of differences as your distance

answered May 18 '10 at 11:29

jk.

13,817
5
37
50

score 0 · Answer 4 · answered May 18 '10 at 14:57

0

I have an impression that it's NP-complete problem.

At least, I cannot see how can we avoid an exhaustive search. Moreover, I cannot even see how can we verify given solution in polynomial time.

answered May 18 '10 at 14:57

Roman

64,384
92
238
332

score 0 · Answer 5 · answered May 19 '10 at 06:29

0

well the problem you're referring to falls under context sensitive grammar. You basically define a grammar, the english grammar in this case and then find the distance between a grammar and a mismatch. You'll need to parse your input first.

answered May 19 '10 at 06:29

Laz

6,036
10
41
54

It's not the English grammar. These are not English words. – May 19 '10 at 08:44

Algorithm for measuring distance between disordered sequences

5 Answers5