How to efficiently calculate squared distance of each element to every other in a matrix?

Question

I have a matrix of the below format:

matrix = array([[-0.2436986 , -0.25583658, -0.16579486, ..., -0.04291612,
        -0.06026303,  0.08564489],
       [-0.08684622, -0.21300158, -0.04034272, ..., -0.01995692,
        -0.07747065,  0.06965207],
       [-0.34814256, -0.20597479,  0.06931241, ..., -0.1236965 ,
        -0.1300714 , -0.110122  ],
       ...,
       [-0.04154776, -0.07538085,  0.01860147, ..., -0.01494173,
        -0.08960884, -0.21338603],
       [-0.34039265, -0.24616522,  0.10838407, ...,  0.22280858,
        -0.03465452,  0.04178255],
       [-0.30251586, -0.23072125, -0.01975435, ...,  0.34529492,
        -0.03508861,  0.00699677]], dtype=float32)

Since, I want to calculate squared distance of each element to every other, I am using the below code:

def sq_dist(a,b):
    """
    Returns the squared distance between two vectors
    Args:
      a (ndarray (n,)): vector with n features
      b (ndarray (n,)): vector with n features
    Returns:
      d (float) : distance
    """
    d = np.sum(np.square(a - b))
    return d

dim = len(matrix)
dist = np.zeros((dim,dim))

for i in range(dim):
    for j in range(dim):
        dist[i,j] = sq_dist(matrix[i, :], matrix[j, :])

I am getting the correct result but only for 5000 elements in 17 minutes (if I use 5000 elements instead of 100k). Since I have 100k*100k matrix, the cluster fails in 5 hours.

How to efficiently do this for a large matrix? I am using Python3.8 and Pyspark.

Output matrix should be like:

dist = array([[0.        , 0.57371938, 0.78593194, ..., 0.83454031, 0.58932155,
            0.76440328],
           [0.57371938, 0.        , 0.66285896, ..., 0.89251578, 0.76511419,
            0.59261483],
           [0.78593194, 0.66285896, 0.        , ..., 0.60711896, 0.80852598,
            0.73895919],
           ...,
           [0.83454031, 0.89251578, 0.60711896, ..., 0.        , 1.01311994,
            0.84679914],
           [0.58932155, 0.76511419, 0.80852598, ..., 1.01311994, 0.        ,
            0.5392195 ],
           [0.76440328, 0.59261483, 0.73895919, ..., 0.84679914, 0.5392195 ,
            0.        ]])

Welcome. Please provide example dataframes **for input and for output** Without it, there's quite much assumptions to make about things you did not mention. — ZygD, Sep 26 '22 at 06:06
@ZygD Thanks. There is no dataframe. I have added input (shape = 100k x 32) and output (shape = 100k x 100K) matrix. Output matrix is created using sq_dist function on the input matrix. — Sajjad Manal, Sep 26 '22 at 11:08
Is the matrix dense or sparse? For sparse matrices: https://stackoverflow.com/a/60028764/4045774 Or are you maybe searching for a nearest neighbour (kd-tree)? Full calculation will need a lot of time and memory, you could also make use of BLAS optimized algorithm https://stackoverflow.com/a/53380192/4045774 ,but this will be less precise. — max9111, Sep 26 '22 at 13:00

Nin17 · Accepted Answer · 2022-09-26T12:11:14.153

2

You can make it significantly faster by using numba:

import numpy as np
import numba as nb

@nb.njit(parallel=True)
def square_dist(matrix):
    dim = len(matrix)
    assert dim > 0
    dist = np.zeros((dim,dim))

    for i in nb.prange(dim):
        for j in nb.prange(dim):
            dist[i][j] = np.square(matrix[i, :] - matrix[j, :]).sum()
    return dist

Test and time:

rng = np.random.default_rng()
matrix = rng.random((200, 10))

assert np.allclose(op(matrix),square_dist(matrix))
%timeit op(matrix)
%timeit square_dist(matrix)

Output:

181 ms ± 556 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
947 µs ± 43.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

edited Sep 26 '22 at 12:11

answered Sep 26 '22 at 12:01

Nin17

2,821
2
4
14

Thanks for that answer. It works well for 50k * 50k matrix. and completed within a minute. However, for greater matrix size, it throws error: ConnectException: Connection refused (Connection refused) ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage. – Sajjad Manal Sep 26 '22 at 13:13
Also, is there any way to store this such that stored file size is not very heavy. (currently it is 900 mb). I tried storing it in pickle format, but it again prompts with the above OOM error. – Sajjad Manal Sep 26 '22 at 13:41
You need to elaborate more on the use of `prange` to run the computations in parallel. It may help the OP and the future readers. – Hristo Iliev Sep 26 '22 at 13:44

score 1 · Answer 2 · answered Sep 26 '22 at 12:50

First, let's do a reality check. Computing N² distances where each one takes 3N-1 operations (N subtractions, N multiplications and N-1 additions) means you have to perform about 3N³ arithmetic operations. When N is 100k, that totals to 3x10¹⁵ operations. A modern CPU with AVX-512 running at 3 GHz (3x10⁹ Hz) can perform 3x10⁹ [cycles/sec] x (512 / 32) [float32 entries in a vector] x 2 [vector ALUs per core] = 10¹¹ float32 operations/second. Therefore, to compute all entries in your distance matrix it will take no less than 3x10¹⁵ / 10¹¹ = 30000 seconds or 8 hrs and 20 mins. This is a hard lower limit, only achievable if all operations are perfectly vectorisable, which they are not, e.g. the horizontal sum after the squaring. If the CPU isn't AVX-512 capable but only supports AVX2, then the vector length is twice as small and the time goes up to about 17 hrs. All this assuming that data fits in the CPU cache - it actually doesn't and it needs proper prefetching.

First thing you can do is cut the compute time in half by noticing that d_ij = d_ji and also d_ii = 0:

for i in range(dim):
    dist[i,i] = 0
    for j in range(i+1, dim):
        d[i,j] = d[j,i] = np.sum(np.square(matrix[i, :] - matrix[j, :]))

Notice the loop here runs only for i < j and the call to sq_dist has been inlined to save you 5x10⁹ unnecessary function calls!!

But even then, you still need more than 4 hrs on that AVX-512 CPU (more than 8 hrs with AVX2 only.)

If you really must cut down that compute time, you need to run it in parallel. With PySpark that means you have to store the vectors in a dataset, perform a self-join, and write a UDF that uses the BLAS implementation that ships with Spark (or install a native one) to compute the distance metric. Unfortunately, this is a low-level interface of Spark and it's only exposed to UDFs written in JVM languages - check this question for a Scala-based solution.

How to efficiently calculate squared distance of each element to every other in a matrix?

2 Answers2