I've got this data that looks the following.
[column 1] [column 2] [column 3] [column 4] [column 5]
[row 1] (some value)
[row 2]
[row 3]
...
[row 700 000]
and a 2nd data set that looks exactly the same, but with fewer rows of about 4.
What i would like to do is to calculate the euclidean distance between each data in the data-set 1 and 2 and find the minimum value of the 4 as seen here:
This is then repeated for the rest of the 700000 rows
of data. I know its not advisable to iterate through numpy
arrays, hence is there any way to calculate the minimum distance of the 4 different rows from data-set 2 fed into 1 row of data-set 1?
Apologies if this is confusing, but my main points is that I do not wish to iterate through the array and I'm trying to find a better way to table this problem.
In the end, i should obtain back a 700 000 row by 1 column data with the best(lowest) value of the 4 green boxes of the data set 2.
import numpy as np
a = np.array([ [1,1,1,1] , [2,2,2,2] , [3,3,3,3] ])
b = np.array( [ [1,1,1,1] ] )
def euc_distance(array1, array2):
return np.power(np.sum((array1 - array2)**2, axis = 1) , 0.5)
print(euc_distance(a,b))
# this prints out [0 2 4]
However, when i tried to input more than 1 dimension,
a = np.array([ [1,1,1,1] , [2,2,2,2] , [3,3,3,3] ])
b = np.array( [ [1,1,1,1] , [2,2,2,2] ] )
def euc_distance(array1, array2):
return np.power(np.sum((array1 - array2)**2, axis = 1) , 0.5)
print(euc_distance(a,b))
# this throws back an error as the dimensions are not the same
I am looking for a way to make it into sort of a 3D array where i get the array of [[euc_dist([1,1,1,1],[1,1,1,1]), euc_dist([1,1,1,1],[2,2,2,2])] , ... ]