How to achieve fast "vector clustering" sort?

Question

I have n c-dimensional vectors that formed into a matrix A with the shape of (n, c), how can I perform a quick sort such that the vectors with low Euclidean distances are as close as possible, and the vectors with high distances are as far as possible? For example, I have

A = [[0, 3], [0, 0], [0, 1]],

and the solution can be

A_sorted = [[0, 3], [0, 1], [0, 0]].

Explain: Because the original A has a total weighted distance sum of 3x1+1x1+2x2 = 7, and A_sorted has 2x1+1x1+3x2 = 8. Mathematically, the goal is to maximize the total weighted distance sum. For the 1-dimensional case, this can be achieved by some APIs like sort() in Numpy or PyTorch, and my main concern is if there exists a fast implementation when c ≥ 2 with a time complexity of O(nlog(n))? After a long struggle, I failed. Could you please do me a favor?

The first possibility that comes to mind is to solve the Traveling Salesman Problem. — Stef, Dec 08 '21 at 12:06
Also, what output do you expect if the vectors form a circle? — Stef, Dec 08 '21 at 12:09
Very similar question: [sorting points to form a continuous line](https://stackoverflow.com/questions/37742358/sorting-points-to-form-a-continuous-line) — Stef, Dec 08 '21 at 12:11
Any permutation satisfied the optimal situation can be what I want. When these points form a circle, I think there can be many solutions, and any of them can be returned to me. Also, I'm confused that if there exists a quick PyTorch implementation to deal with a batch of instances, each instance consists of N C-dimentional points, i.e. the input data is with a shape of (B, C, N)? @Stef — BinChen, Dec 24 '21 at 01:35
I think it's hard to directly find the optimal results, so I am pursuing a weaker solution which is fast and can find results that may not be optimal but better than random permutation or just calculating their L2 norm (distance from [0,0]) and sort, at least. @Stef — BinChen, Dec 24 '21 at 01:39
did you try to apply the answer from the similar question I linked? — Stef, Dec 24 '21 at 11:10

score 1 · Accepted Answer · answered Dec 08 '21 at 13:07

1

I believe this is what you want:

def d(l):
    return math.sqrt(sum([x**2 for x in l]))

A = [[0, 3], [0, 0], [0, 1]]

dists = [d(el) for el in A]
sorted_vecs = sorted(zip(A, dists), key=lambda x: -x[1])
[x[0] for x in sorted_vecs]

Returns

[[0, 3], [0, 1], [0, 0]]

It's O(nlog(n))

answered Dec 08 '21 at 13:07

PermanentPon

702
5
10

Thanks for your help. But I think this solution may not be true since the points are sorted with refering their L2 distances from the original point [0,0], and this can not guarantee the optimality. In fact, this problem may be related to Traveling Salesman Problem and very hard, so I am pursuing a weaker solution which is fast and can find results that may not be optimal but better than random permutation or just calculating their L2 norm (distance from [0,0]) and sort, at least. @PermanentPon – BinChen Dec 24 '21 at 01:45
1

Note that in general when using `sum`, `max`, `any`, `all`, you should avoid using square brackets. `sum([x**2 for x in l])` and `sum(x**2 for x in l)` both return the same result, but the latter is more efficient. – Stef Dec 24 '21 at 10:58
1

@Stef are you referring to the fact that list comprehension creates a list object but some build-in functions including `sum` accepts generators? Then yes, it will be faster as you proposed. – PermanentPon Dec 24 '21 at 13:19
1

Got you @BinChen. Wasn't very clear from the title and your example sorry. My proposed solution wouldn't help you much. Have you checked any optimisation based solutions for the Traveling Salesman problem, e.g. https://stackoverflow.com/a/44080908/1719231 ? – PermanentPon Dec 24 '21 at 13:22
@PermanentPon Yes. There's really no reason at all to add square brackets inside `sum`. The list comprehension will use extra memory for no reason, and will make it slightly slower, and it doesn't make the code more readable (it adds clutter, if anything). In the case of `any` and `all` it would be even worse, the list comprehension forces to evaluate all values rather than short-circuiting as soon as possible. – Stef Dec 24 '21 at 19:11

How to achieve fast "vector clustering" sort?

1 Answers1