I have test
and train
sets with the following dimensions with all features (i.e. columns) as integers.
X_train.shape
(990188L, 19L)
X_test.shape
(424367L, 19L)
I want to find out the euclidean distance among all the rows of train
set and all the rows of the test
set.
I have to also remove the rows from the train set with a distance threshold of 0.005
.
I have a following linear code which is too slow but works fine.
for a in range(X_test.shape[0]):
a_test = np_Test[a]
for b in range(X_train.shape[0]):
a_train = np_Train[b]
if(a != b):
dst = distance.euclidean(a_test, a_train)
if(dst <= 0.005):
train.append(b)
where I note down the indexes of the rows that lie within the distance threshold.
Is there any way to parallelize this code?
I tried using from sklearn.metrics.pairwise import euclidean_distances
but as the data set is huge, I am getting a memory error.
I tried to parallelize the code by using euclidean_distances
is batches but some how I think the following code is not working fine.
Please help me if there is any way to parallelize the code.
rows = X_train.shape[0]
rem = rows%1000
no = rows/1000
i = 0
while (i <= no*1000) :
dst_mat = euclidean_distances(X_train[i:i+1000, :], X_test)
condition = np.any(dst_mat <= 0.005, axis = 1)
index = np.where(condition == True)
index = np.add(index, i)
print(index)
print(dst_mat)
i+=1000