Find element similarity within a collection of strings without evaluating all element pairs

Question

So the problem collection is something like:

A = {'abc', 'abc', 'abd', 'bcde', 'acbdg', ...}

Using some type of string metric like Levenshtein distance, it's simple enough to find some sort of heuristic of string similarity between 2 strings.

However, I would like to determine, without evaluating all pairs of strings in the collection (an O(N^2) problem), some type of heuristic based on the entire collection that gives me a good idea of the overall similarity between all the strings.

The brute force approach is:

                          Sum(Metric(All Pairs in A))
CollectionSimilarity(A) = ---------------------------
                                 N*(N+1)/2

Is there a way to evaluate the similarity of the entire collection of A without evaluating every pair?

Is edit distance your definition of "similarity"? Like, what is your Platonic definition of similarity? — JasonN, Jan 12 '15 at 21:15
Something like 1/(Levenshtein distance), but the metric doesn't matter too much. — Dennis, Jan 12 '15 at 21:22
Defining the best heuristic would depend greatly on what you are trying to do in the end, but some possibilities would be the average of the distances between adjacent elements (`O(N)` instead of `O(N^2)`), or maybe the average of distances between one representative and all the other elements, or the average across a randomly-generated subset of all the pairs... There are plenty of others, but your actual problem domain would influence which of them is "best"... — twalberg, Jan 12 '15 at 21:32
Excellent point. For this problem, let's assume one representative is not a feasible solution (which would basically solve this problem). Adjacent elements is not feasible either in the case of A = {a,b,a,b,a,b,a,...}. Randomly-generated, or cleverly designed patterns like the one I just posited would be the ideal problem domain. — Dennis, Jan 13 '15 at 22:14

score 0 · Answer 1 · answered Jan 13 '15 at 15:56

0

You can always use some approximation (eg. sampling pairs). Depending on how large N is, this value should converge with NlogN samples.

answered Jan 13 '15 at 15:56

ElKamina

7,747
28
43

Are you referencing the Central Limit Theorem? Perhaps you can expand your answer? – Dennis Jan 14 '15 at 19:51
I am actually not very sure. See http://en.wikipedia.org/wiki/Sample_size_determination – ElKamina Jan 15 '15 at 03:55

score 0 · Answer 2 · answered Jan 13 '15 at 17:05

Since every string is a vector in some metric space (where every char is particular coordinate), my solution is to find the distance between the set A and some point P.

Let's look at one metric's property - the triangle inequality:

Distance(x, y) <= Distance(x, *P*) + Distance(y, *P*)

So we can find an upper bound of Sum(Distance(All pairs in A)) as |A| * Sum(Distance(All elements in A to point P):

  Sum(Distance(x, y))      N * Sum(x, *P*)     Sum(x, *P*)
---------------------- <= ----------------- = ------------
     N*(N+1)/2               N*(N+1)/2          (N+1)/2

This point P can be random point or the center of mass (in this case you find an average radius of set) of set A or empty string (zero point) or anything else. Generally speaking P may be any hyperplane. Anyway you'll find some kind of average radius (or diameter) of your set.
Maybe some linear pre-transformation [of set or coordinate system, which is the same] is good. Or iterate multiple times and on every iteration find the distance to new random hyperplane.

Hope this may help!

Find element similarity within a collection of strings without evaluating all element pairs

2 Answers2