0

I have test and train sets with the following dimensions with all features (i.e. columns) as integers.
X_train.shape
(990188L, 19L)

X_test.shape
(424367L, 19L)

I want to find out the euclidean distance among all the rows of train set and all the rows of the test set. I have to also remove the rows from the train set with a distance threshold of 0.005. I have a following linear code which is too slow but works fine.

for a in range(X_test.shape[0]):
    a_test = np_Test[a]
    for b in range(X_train.shape[0]):
        a_train = np_Train[b]
        if(a != b):
            dst = distance.euclidean(a_test, a_train)
            if(dst <= 0.005):
                train.append(b)

where I note down the indexes of the rows that lie within the distance threshold. Is there any way to parallelize this code? I tried using from sklearn.metrics.pairwise import euclidean_distances but as the data set is huge, I am getting a memory error.

I tried to parallelize the code by using euclidean_distances is batches but some how I think the following code is not working fine. Please help me if there is any way to parallelize the code.

rows = X_train.shape[0]
rem = rows%1000
no = rows/1000
i = 0

while (i <= no*1000) :
    dst_mat = euclidean_distances(X_train[i:i+1000, :], X_test)
    condition = np.any(dst_mat <= 0.005, axis = 1)
    index = np.where(condition == True)
    index = np.add(index, i)
    print(index)
    print(dst_mat)
    i+=1000
Niranjan Agnihotri
  • 916
  • 2
  • 11
  • 19

1 Answers1

0

Use scipy.spatial.cdist. This will calculate the pairwise distance.

Thanks to Warren Weckesser for pointing out this solution.

Joe
  • 6,758
  • 2
  • 26
  • 47
  • maybe, because it will only calculate half of the values. – Joe Jan 31 '18 at 06:48
  • Let me try it and come back to you. Thanks for help. – Niranjan Agnihotri Jan 31 '18 at 06:49
  • Have you made a simple estimation of the memory you need? – Joe Jan 31 '18 at 06:50
  • Ok yes. Taking the eulcidean_distances() function into consideration, I am able to run the above batch code successfully. For a batch of 1000 train examples and all test examples it takes around 14GB. I am able to run it on a 32 GB Ram Machine. – Niranjan Agnihotri Jan 31 '18 at 06:53
  • 2
    To compute distances between the points in two different arrays, use [`scipy.spatial.distance.cdist`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html) (not `pdist`). – Warren Weckesser Jan 31 '18 at 08:45
  • 2
    `cdist` does not return any redundant data, so there is no need for it to return a condensed form. – Warren Weckesser Jan 31 '18 at 09:52
  • I think the `cdist` function serves same as the function `euclidean_distances` and both give memory error to the above data set's dimensions. I wanted help in above code snippet that tries to carry on the operation in batches so that we don't get the memory error. thanks for reaching out by the way :) – Niranjan Agnihotri Feb 02 '18 at 08:54
  • I found other questions with a similar topic. Maybe they help you https://stackoverflow.com/questions/3674409/how-to-split-partition-a-dataset-into-training-and-test-datasets-for-e-g-cros https://stackoverflow.com/questions/39622639/how-to-break-numpy-array-into-smaller-chunks-batches-then-iterate-through-them https://stackoverflow.com/questions/48535829/how-do-i-find-the-euclidean-distances-between-rows-of-my-test-and-train-set-effi – Joe Feb 02 '18 at 17:43