5

I have k sets of vectors. The vectors are all the same length m. The sets are not all the same length, but let's say they have an average length of n vectors in each. I need to find the group of vectors, one from each set, that has the minimum distance (L2 norm) to each other. This is similar to the "closest pair" problem, but that's for just 2 sets, whereas I have k sets.

The naïve way is to cross join all the values and search through all O(n^k) distances. Is there a better way/algorithm?

Example  
Set A [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]  
Set B [[0.5, 0.9], [0.1, 0.3], [0.9, 0.1]]  
Set C [[0.2, 0.2], [0.8, 0.4], [0.5, 0.1]]  
Result - A [0.1, 0.2], B [0.1, 0.3], C [0.2, 0.2] with L2 distance 0.14  

Olivier
  • 13,283
  • 1
  • 8
  • 24
fariadantes
  • 347
  • 3
  • 17
  • 1
    by "the minimum of distance to each other", do you mean the minimal sum of the distances between all pair of vectors from the group? – Cadeyrn Aug 03 '22 at 20:20
  • @Cadeyrn I'm open to suggestions. I was thinking of calculating the total euclidean distance, which (I believe) takes the difference between the minimum and maximum values of each dimension separately, squares those differences, then sums those squares, then takes the square root of the sum of the squares. But maybe that's not the best approach. – fariadantes Aug 05 '22 at 02:55
  • 1
    For that kind of problem, giving values of `k`, `m` and `n` is important. Depending on them, it may be possible to use an exact algorithm, or only a heuristic may be usable. – Olivier Aug 07 '22 at 07:33
  • what is the range of each coordinate in your vectors ? sample suggest `<0,1>` range – Spektre Aug 09 '22 at 07:14
  • I added working C++ code for my approach to my answer – Spektre Aug 10 '22 at 09:04

3 Answers3

1

I think I got an interesting heuristics for this one.

Put the first element of each of the O(n k) vectors in an array v[1].

Put the second elements of each of the O(n k) vectors in an array v[2].

...

Put the m-th elements of each of the O(n k) vectors in an array v[m].

Total time complexity for this step: O(m n k).


Put all the pairs of elements of v[1] in an array p[1] and sort it by the squared difference of the pair in O((n k)^2 log(n k)). Start a counter cnt: = 0. Iterate through p[1] and look at the second element of each pair. Whenever you see a new element x at position i, save a map map[1][x]: = i, put it in an array men[1], increment cnt and put 1 in the cnt-th position of the array women[x]. In the end, men[1] should have k elements.

Put all the pairs of elements of v[2] in an array p[2] and sort it by the squared difference of the pair in O((n k)^2 log(n k)). Start a counter cnt: = 0. Iterate through p[2] and look at the second element of each pair. Whenever you see a new element x at position i, save a map map[2][x]: = i, put it in an array men[2], increment cnt and put 2 in the cnt-th position of the array women[x]. In the end, men[2] should have k elements.

...

Put all the pairs of elements of v[m] in an array p[m] and sort it by the squared difference of the pair in O((n k)^2 log(n k)). Start a counter cnt: = 0. Iterate through p[m] and look at the second element of each pair. Whenever you see a new element x at position i, save a map map[m][x]: = i, put it in an array men[m], increment cnt and put m in the cnt-th position of the array women[x]. In the end, men[m] should have k elements.

Total time complexity for this step: O(m (n k)^2 log(n k) ).


Consider a stable marriage problem with m men and k women, where the preferences of a man i are described by the previous calculated men[i] array and the preferences of a woman j are described by the previous calculated women[j] array. Solve it using Gale-Shapley algorithm and use the map previously calculated to find the pair to which each match is associated in the p array. Put each of these pairs in an array match.

You can use match to get a good vector selection in linear time on its size.

Total time complexity for this step: O(m k).


Final time complexity: O(m (n k)^2 log(n k) )

joaopfg
  • 1,227
  • 2
  • 9
  • 18
  • The other answers here are interesting as well, but I think this is the best approach for my needs. This is essentially a stable marriage problem (I had not heard of this) with a little extra polygamy thrown in for good measure. Gale-Shapely with tweaks will work. Thank you! – fariadantes Aug 20 '22 at 12:27
0

The Curse of Dimensionality is not your friend. If m is reasonably large, odds are good that all pairs of vectors are at similar distances from each other. And therefore there is no clump of good vectors to find. And therefore your search - even through reasonable candidates - is going to have to look at a lot of options.

That said, I will describe a more efficient algorithm. But note that it also consumes a lot of memory. So it will scale better than your current algorithm at first, but it will also top out at memory limits. Hopefully it is good enough for your needs.

First I'm going to describe a hypothetical directed graph. (We are going to try to avoid actually constructing most of it, but conceptually it exists.) It will be in layers. the l'th layer will be the set of all collections of l points, one from each of the first l matrices. We connect the l'th layer to the l+1th if they agree on the first l points, and the weight of the edge connecting them is the sum of the distances from the last to all other. The 0th node is the empty set. And we join the kth layer to our target with edges of weight 1.

Verify the following. A path from the empty set to the target has as its total weight the total L2 norm of its k'th level. Therefore the lowest weight path gives you the answer you are looking for.

Therefore this is a good candidate for A* search. The heuristic to use for the rest of the path is:

c = sum(points) / len(points)
s = sum(distance(p, c) for p in points)
heuristic s * (k - len(points))

In other words the remaining points can do no better than to be at the center of mass, and the heuristic is what the sum of the remaining distances would be if this was true.

If there is a natural clump to find, this can find the answer with as little as O(m * n * (n + k) * log(n*k))) work. If there are no natural clumps, there are likely examples that come arbitrarily close to the worst case bound, which is something like O(n^k * m * k^2 * log(n)). And, of course, it can require O(n^k) memory, at which point you simply crash.

So this might or might not work, but it is worth trying.

btilly
  • 43,296
  • 3
  • 59
  • 88
0

so you have k sets each with ~n vectors and each vector has m dimensions.

For tasks like this I usually use maps of all grid aligned "directions" and compute only on them instead of using the direction present in input data this usually significantly reduce the problem combinations count and also can be done recursively with increasing precision (near last found solution with coarser grid). I would attack your case like this:

  1. create map of major vector "directions" O(m*(s^m))

    in case you have coordinate range <0,1> then create a map of all vector combinations with some s steps for example 11 steps will in 2D lead to

    (0.0,0),(0.0,0.1)...(1.0,0.9),(1.0,1.0)
    

    hold them in m dimensional array of vectors for easy access again 2D example:

     vec2 dir[s][s] = 
        { 
        { (0.0,0.0),(0.0,0.1)...(0.0,1.0) },
        { (0.1,0.0),(0.1,0.1)...(0.1,1.0) },
        ...
        { (1.0,0.0),(1.0,0.1)...(1.0,1.0) }
        };
    
  2. for each vector of dir compute max distance to each set O(n*k*(s^m))

    so simply loop through all vectors of dir and for each compute smallest distance each set and remember only the biggest one. Hold the result in m dimensional array dis so:

    dis[i1][i2]...[im] = max( |dir[i1][i2]...[im],set[1]| ,
                              |dir[i1][i1]...[im],set[2]| ,
                              ... ,
                              |dir[i1][i2]...[im],set[k]| )
    
  3. find min distance in dis O(s^m)

  4. find closest element in each set to found vector in dir corresponding to min distance in dis O(m*n*k*m)

    this is your wanted output. this might be speed up by binary search if n is really big to O(m*log(n)*k*m).

Beware the s should be chosen big enough so a valid solution is not missed. The resulting complexity is around O(m*n*k*(s^m)) assuming m is not too big a number. Also this will be faster only if:

O(m*(n^k)) > O(n*k*(s^m))

The constant time should be relatively similar for both approaches leading to something like:

m*(n^k) ~> n*k*(s^m)

assuming "real" case where m<k and s>n for example m=3,k=10,n=50,s=100:

3*(50^10) ~> 50*10*(100^3)
292968750000000000 ~> 500000000
585937500 ~> 1

See these QAs using very similar technique (map of "directions"):

[Edit1] small C++/VCL example

//---------------------------------------------------------------------------
const int K=3; // sets
const int N=3; // max set size
const int M=2; // dimensions
float set[K][N][M]=
    {
    {{0.1, 0.2}, {0.3, 0.4}, {0.5, 0.6}},
    {{0.5, 0.9}, {0.1, 0.3}, {0.9, 0.1}},
    {{0.2, 0.2}, {0.8, 0.4}, {0.5, 0.1}}
    };
//---------------------------------------------------------------------------
float distance2(float *a,float *b)  // |a-b|^2
    {
    int m;
    float x,dd;
    for (dd=0.0,m=0;m<M;m++)
        {
        x=a[m]-b[m];
        dd+=x*x;
        }
    return dd;
    }
//---------------------------------------------------------------------------
float get_closest(int *ix)  // returns min distance between sets
    {                       // ix[K] will hold selected closest vectors from sets
    const int S=11;         // grid steps
    const float dx=1.0/float(S-1); // grid step
    int s1,s2,k,n,m,s,ss1,ss2;
    float x1,x2,d,d1,d2;
    float dir[S][S][M],dis[S][S];

    // create dir[][]
    for (s1=0,x1=0.0;s1<S;s1++,x1+=dx)
     for (s2=0,x2=0.0;s2<S;s2++,x2+=dx)
        {
        m=0;
        dir[s1][s2][m]=x1; m++;
        dir[s1][s2][m]=x2; m++;
        }

    // compute dis[][] and its minimum dis[ss1][ss2]
    ss1=0; ss2=0;
    for (s1=0;s1<S;s1++)
     for (s2=0;s2<S;s2++)
        {
        // max distance between set[] and dir[s1][s2]
        for (d2=0.0,k=0;k<K;k++)
            {
            // min distance between set[k][] and dir[s1][s2]
            for (d1=0.0,n=0;n<N;n++)
                {
                d=distance2(dir[s1][s2],set[k][n]);
                if ((n==0)||(d1>d)) d1=d;
                }
            if (d2<d1) d2=d1;
            }
        dis[s1][s2]=d2;
        // remember smallest one
        if (dis[ss1][ss2]>dis[s1][s2]){ ss1=s1; ss2=s2; }
        }

    // find corresponding indexes from set[][]
    for (k=0;k<K;k++)
     for (d1=0.0,ix[k]=0,n=0;n<N;n++)
        {
        d=distance2(dir[ss1][ss2],set[k][n]);
        if ((n==0)||(d1>d)){ d1=d; ix[k]=n; }
        }

    // compute real distance
    for (d1=0.0,m=0;m<M;m++)
        {
        for (k=0;k<K;k++)
            {
            d=set[k][ix[k]][m];
            if (k==0){ x1=d; x2=d; }
            if (x1>d) x1=d;
            if (x2<d) x2=d;
            }
        d=x2-x1; d1+=d*d;
        }
    return sqrt(d1);
    }
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner):TForm(Owner)
    {
    AnsiString s;
    int k,n,m,ix[K];
    mm_log->Lines->Add(get_closest(ix));
    for (k=0;k<K;k++)
        {
        for (n=ix[k],s="",m=0;m<M;m++) s+=AnsiString().sprintf("%3.1f ",set[k][n][m]);
        mm_log->Lines->Add(s);
        }
    }
//-------------------------------------------------------------------------

Just ignore the VCL stuff (last function that uses the get_closest and prints its results) the important stuff is function get_closest which returns array of indexes of closest vectors from each set. Here result:

0.141421367827692
0.1 0.2 
0.1 0.3 
0.2 0.2 

The code expects all sets are the same length, for variable one you just need to tweak it a small bit (all the loops for n)

In case memory is an issue Note that the arrays dir,dis might be removed if all the iterations are moved to the same loops ...

In case you want to have this ready for any dimensionality see:

look for nested_for you can implement the s^m loops with it easily without any weird hacks ...

Spektre
  • 49,595
  • 11
  • 110
  • 380