0

I have a very large matrix of size (10000000, 250). I want to calculate the euclidean distance between each column of the matrix.

If I try to use the scipy.spatial.distance functions, it tries to create a matrix obviously of size (10000000, 10000000).

To bypass that, I decided to write my own function, but it takes forever to run. Any way to optimize this would be highly appreciated:

import numpy as np
import scipy.spatial as scsp

def calcDist(mat):
    dis= []
    for i in range(mat.shape[0]):
        for j in range(i):
            dis.append(scsp.distance.euclidean(mat[i,:], mat[j,:]))
    return np.array(dis)
    
Q = np.random.random([10000000, 250])
distQ = calcDist(Q)
deathracer
  • 305
  • 1
  • 20
  • It sounds like you want the distances between **rows**, not columns. Typically the rows are the instances of the data, so that would make more sense. Be aware that distances in 250-dimensional space might not be useful (see "the curse of dimensionality"). In the meantime, check [`pdist`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html): it returns only half the matrix, flattened. If it's still too big/slow, maybe look at [`dask`](https://www.dask.org/), maybe using something like [this](https://dask-distance.readthedocs.io/en/latest/dask_distance.html). – Matt Hall Jun 21 '22 at 12:17
  • Hope this help : https://stackoverflow.com/questions/51350640/parallel-for-loop-over-numpy-matrix – minattosama Jun 21 '22 at 12:17
  • Do you want to calculate each column with all the other columns? – I'mahdi Jun 21 '22 at 13:03
  • There is no way you can compute this in a reasonable time and the output is stupidly huge. First of all your input matrix is HUGE: it takes 20 GiB of RAM. The output is *astronomically HUGE*. Indeed, the number of iterations is `10_000_000 * (10_000_000-1) / 2 = 49_999_995_000_000` and the this of the output size Numpy array is 372_528 GiB. This is more RAM than what any computer in the world can have (so this only work with memory mapped array which are very slow not to mention a 373 TiB storage device will certainly cost >10_000$)... This does not make any sense. – Jérôme Richard Jun 21 '22 at 17:02
  • 2
    The only reasonable option is to *reconsider the need for generating hundred of terabytes* and compute your data **in-place**. In fact, I strongly advise you to reconsider the need for such a computation. Can't your input data be reduced to a reasonable size? By the way, building a distance matrix is generally a very bad idea for large dataset (due to the quadratic complexity) and they are generally not really useful for dependent algorithms. – Jérôme Richard Jun 21 '22 at 17:06

0 Answers0