2

So the problem collection is something like:

A = {'abc', 'abc', 'abd', 'bcde', 'acbdg', ...}

Using some type of string metric like Levenshtein distance, it's simple enough to find some sort of heuristic of string similarity between 2 strings.

However, I would like to determine, without evaluating all pairs of strings in the collection (an O(N^2) problem), some type of heuristic based on the entire collection that gives me a good idea of the overall similarity between all the strings.

The brute force approach is:

                          Sum(Metric(All Pairs in A))
CollectionSimilarity(A) = ---------------------------
                                 N*(N+1)/2

Is there a way to evaluate the similarity of the entire collection of A without evaluating every pair?

Software Engineer
  • 15,457
  • 7
  • 74
  • 102
Dennis
  • 597
  • 4
  • 8
  • 24
  • Is edit distance your definition of "similarity"? Like, what is your Platonic definition of similarity? – JasonN Jan 12 '15 at 21:15
  • Something like 1/(Levenshtein distance), but the metric doesn't matter too much. – Dennis Jan 12 '15 at 21:22
  • Defining the best heuristic would depend greatly on what you are trying to do in the end, but some possibilities would be the average of the distances between adjacent elements (`O(N)` instead of `O(N^2)`), or maybe the average of distances between one representative and all the other elements, or the average across a randomly-generated subset of all the pairs... There are plenty of others, but your actual problem domain would influence which of them is "best"... – twalberg Jan 12 '15 at 21:32
  • Excellent point. For this problem, let's assume one representative is not a feasible solution (which would basically solve this problem). Adjacent elements is not feasible either in the case of A = {a,b,a,b,a,b,a,...}. Randomly-generated, or cleverly designed patterns like the one I just posited would be the ideal problem domain. – Dennis Jan 13 '15 at 22:14

2 Answers2

0

You can always use some approximation (eg. sampling pairs). Depending on how large N is, this value should converge with NlogN samples.

ElKamina
  • 7,747
  • 28
  • 43
0

Since every string is a vector in some metric space (where every char is particular coordinate), my solution is to find the distance between the set A and some point P.

Let's look at one metric's property - the triangle inequality:

Distance(x, y) <= Distance(x, *P*) + Distance(y, *P*)

So we can find an upper bound of Sum(Distance(All pairs in A)) as |A| * Sum(Distance(All elements in A to point P):

  Sum(Distance(x, y))      N * Sum(x, *P*)     Sum(x, *P*)
---------------------- <= ----------------- = ------------
     N*(N+1)/2               N*(N+1)/2          (N+1)/2

This point P can be random point or the center of mass (in this case you find an average radius of set) of set A or empty string (zero point) or anything else. Generally speaking P may be any hyperplane. Anyway you'll find some kind of average radius (or diameter) of your set.
Maybe some linear pre-transformation [of set or coordinate system, which is the same] is good. Or iterate multiple times and on every iteration find the distance to new random hyperplane.

Hope this may help!

pkuderov
  • 3,501
  • 2
  • 28
  • 46