1

I've got this data that looks the following.

                [column 1]   [column 2]   [column 3]   [column 4]   [column 5]
[row 1]        (some value)
[row 2]
[row 3]
...
[row 700 000]

and a 2nd data set that looks exactly the same, but with fewer rows of about 4. What i would like to do is to calculate the euclidean distance between each data in the data-set 1 and 2 and find the minimum value of the 4 as seen here: enter image description here

This is then repeated for the rest of the 700000 rows of data. I know its not advisable to iterate through numpy arrays, hence is there any way to calculate the minimum distance of the 4 different rows from data-set 2 fed into 1 row of data-set 1?

Apologies if this is confusing, but my main points is that I do not wish to iterate through the array and I'm trying to find a better way to table this problem.

In the end, i should obtain back a 700 000 row by 1 column data with the best(lowest) value of the 4 green boxes of the data set 2.

import numpy as np

a = np.array([ [1,1,1,1] , [2,2,2,2] , [3,3,3,3] ])
b = np.array( [ [1,1,1,1] ] )

def euc_distance(array1, array2):
    return np.power(np.sum((array1 - array2)**2, axis = 1) , 0.5)
print(euc_distance(a,b))
# this prints out [0 2 4]

However, when i tried to input more than 1 dimension,

a = np.array([ [1,1,1,1] , [2,2,2,2] , [3,3,3,3] ])
b = np.array( [ [1,1,1,1] , [2,2,2,2] ] )

def euc_distance(array1, array2):
    return np.power(np.sum((array1 - array2)**2, axis = 1) , 0.5)
print(euc_distance(a,b))
# this throws back an error as the dimensions are not the same

I am looking for a way to make it into sort of a 3D array where i get the array of [[euc_dist([1,1,1,1],[1,1,1,1]), euc_dist([1,1,1,1],[2,2,2,2])] , ... ]

Axois
  • 1,961
  • 2
  • 11
  • 22
  • is this what you want https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy – WhySoSerious Jun 29 '19 at 10:21
  • Hi, thanks for helping. But sadly this is not what im looking for. – Axois Jun 29 '19 at 11:54
  • Could you add some example data, so we can test for ourself? Maybe 20 rows of your 700k array and the 4 rows of your 2nd dataset – Erfan Jun 29 '19 at 12:13
  • Hi, so from my picture, the box 1 is the row, and each row has 5 columns with values 2,3,4,6,7. Same goes for the green box. The problem over here is to be able the access the values of the black box without iterating through it – Axois Jun 29 '19 at 12:45

3 Answers3

1

Couldn't test it, but this should get you there assuming normalised positive data. np.argmax(np.matmul(a, b.T), axis=1)

Little elaboration of my previous post. If performance is still an issue, instead of your approach you can use this:

b = np.tile(b, (a.shape[0], 1, 1))
a = np.tile(a, (1, 1, b.shape[1])).reshape(b.shape)
absolute_dist = np.sqrt(np.sum(np.square(a - b), axis=2))

It produces the exact same result but runs about 20 times faster on 600,000 lines than the generator.

lmielke
  • 135
  • 8
  • Hi, thanks for answering but im looking for the euclidean distance between each green box and black box and then finding the minimum distance of the 4 green boxes instead of doing matrix multiplication. The resulting matrix should be a 700000 row by 1 column. – Axois Jun 29 '19 at 12:08
  • Well, matmul here results in a 700k*4 and argmax makes it a 700k*1. The problem with my approach is, that it's only half the solution, because it doesn't account for vector magnitude. To fix this, you have to divide the matmul result by the Produkt of the euclidean lengths of the corresponding vectors before doing a argMIN. I am not sitting at a computer, so can't write a expression. There is another hands on approach. You can use np.tile to reshape both boxes to 4*700k*5 and then do a element by element subtract. – lmielke Jun 29 '19 at 13:21
  • Hello, I have edited my question with some code to hopefully clarify my question – Axois Jun 29 '19 at 15:39
  • Hello, I also have included an example above. If performance is still an issue, you can try that as well. – lmielke Jul 01 '19 at 11:02
1

You can use broadcasting for this:

a = np.array([
    [1,1,1,1],
    [2,2,2,2],
    [3,3,3,3]
])
b = np.array([
    [1,1,1,1],
    [2,2,2,2]
])

def euc_distance(array1, array2):
    return np.sqrt(np.sum((array1 - array2)**2, axis = -1))

print(euc_distance(a[None, :, :], b[:, None, :]))
# [[0. 2. 4.]
#  [2. 0. 2.]]

Comparing the times for a dataset of your size:

a = np.random.rand(700000, 4)
b = np.random.rand(4, 4)

c = euc_distance(a[None, :, :], b[:, None, :])
d = np.array([euc_distance(a, val) for val in b])
e = np.array([euc_distance(val, b) for val in a]).T

np.allclose(c, d)
# True
np.allclose(d, e)
# True

%timeit euc_distance(a[None, :, :], b[:, None, :])
# 113 ms ± 4.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.array([euc_distance(a, val) for val in b])
# 115 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.array([euc_distance(val, b) for val in a])
# 7.03 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Nils Werner
  • 34,832
  • 7
  • 76
  • 98
  • Hello, thanks for helping! My end result should be an array of [[0. 2.] [2. 0.] [4. 2.]] as oppose to yours. However i did some modification to the code and yield the same results by inputting euc_distance(a[:, None, :], b[None, :, :]). Perhaps you can explain how does broadcasting works? Thanks! – Axois Jul 01 '19 at 11:23
  • ... or you can simply transpose the output. [Have you read the NumPy docs on broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)? – Nils Werner Jul 01 '19 at 12:19
  • I've just read it but I cant seem to link it as to how your code works. Btw when iterating through larger loops, ive encountered a memory error, could it be one of the limitations when doing such slicing? – Axois Jul 01 '19 at 13:09
  • Yes, broadcasting will calculate all values first, before doing the sum. In this instance, `c` will temporarily consume 4x as much memory as `d`. – Nils Werner Jul 01 '19 at 15:07
  • Hi, correct me if I am wrong, I've did some learning on my own and if I'm not wrong what you are trying to do is to convert the array first into 3D so that you can broadcast the shapes together? What does setting axis= -1 do btw? Does it reduce the dimension back from 3D to 2D? – Axois Jul 02 '19 at 04:19
  • **edit** Okay, i realised that setting axis = -1 did the same as axis = 2, which is to sum up all of the values nested in the list. – Axois Jul 02 '19 at 06:00
0

Thanks for everyone's help, however i think I've managed to solve my own problem by using a simple list comprehension. I was over-complicating things! By doing so, instead of iterating each data, i have essentially cut-down more than half of the time which increases exponentially.

What i did was the following c = np.array( [euc_distance(val, b) for val in a]) who knew this problem could have such a simple solution!

Axois
  • 1,961
  • 2
  • 11
  • 22