I have a matrix of the below format:
matrix = array([[-0.2436986 , -0.25583658, -0.16579486, ..., -0.04291612,
-0.06026303, 0.08564489],
[-0.08684622, -0.21300158, -0.04034272, ..., -0.01995692,
-0.07747065, 0.06965207],
[-0.34814256, -0.20597479, 0.06931241, ..., -0.1236965 ,
-0.1300714 , -0.110122 ],
...,
[-0.04154776, -0.07538085, 0.01860147, ..., -0.01494173,
-0.08960884, -0.21338603],
[-0.34039265, -0.24616522, 0.10838407, ..., 0.22280858,
-0.03465452, 0.04178255],
[-0.30251586, -0.23072125, -0.01975435, ..., 0.34529492,
-0.03508861, 0.00699677]], dtype=float32)
Since, I want to calculate squared distance of each element to every other, I am using the below code:
def sq_dist(a,b):
"""
Returns the squared distance between two vectors
Args:
a (ndarray (n,)): vector with n features
b (ndarray (n,)): vector with n features
Returns:
d (float) : distance
"""
d = np.sum(np.square(a - b))
return d
dim = len(matrix)
dist = np.zeros((dim,dim))
for i in range(dim):
for j in range(dim):
dist[i,j] = sq_dist(matrix[i, :], matrix[j, :])
I am getting the correct result but only for 5000 elements in 17 minutes (if I use 5000 elements instead of 100k). Since I have 100k*100k matrix, the cluster fails in 5 hours.
How to efficiently do this for a large matrix? I am using Python3.8 and Pyspark.
Output matrix should be like:
dist = array([[0. , 0.57371938, 0.78593194, ..., 0.83454031, 0.58932155,
0.76440328],
[0.57371938, 0. , 0.66285896, ..., 0.89251578, 0.76511419,
0.59261483],
[0.78593194, 0.66285896, 0. , ..., 0.60711896, 0.80852598,
0.73895919],
...,
[0.83454031, 0.89251578, 0.60711896, ..., 0. , 1.01311994,
0.84679914],
[0.58932155, 0.76511419, 0.80852598, ..., 1.01311994, 0. ,
0.5392195 ],
[0.76440328, 0.59261483, 0.73895919, ..., 0.84679914, 0.5392195 ,
0. ]])