1

I have written code using numpy that takes an array of size (m x n)... The rows (m) are individual observations comprised of (n) features... and creates a square distance matrix of size (m x m). This distance matrix is the distance of a given observation from all other observations. E.g. row 0 column 9 is the distance between observation 0 and observation 9.

import numpy as np
#import cupy as np

def l1_distance(arr):
    return np.linalg.norm(arr, 1)

X = np.random.randint(low=0, high=255, size=(700,4096))

distance = np.empty((700,700))

for i in range(700):
    for j in range(700):
        distance[i,j] = l1_distance(X[i,:] - X[j,:])

I attempted this on GPU using cupy by umcommenting the second import statement, but obviously the double for loop is drastically inefficient. It takes numpy approx 6 seconds, but cupy takes 26 seconds. I understand why but it's not immediately clear to me how to parallelize this process.

I know I'm going to need to write a reduction kernel of some sort, but I can't think of how to construct one cupy array from iterative operations on elements of another array.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Moose Sims
  • 141
  • 2
  • 11
  • Why aren't you using scipy.spatial.distance.cdist? That would have to be 5 to 10 times faster than your CPU loops – talonmies Oct 22 '20 at 07:07
  • I explicitly coded out what I was trying to do for reader comprehension. In real life I wouldn't do double loops for the CPU implementation. However as far as know, there is no Cuda kernel, both elementwiseKernel or reductionKernel like scipy for a GPU implementation. This is just a toy dataset. My real dataset is much larger than this, hence why im trying to parallelize it and port it to GPU – Moose Sims Oct 22 '20 at 07:30

1 Answers1

4

Using broadcasting CuPy takes 0.10 seconds in a A100 GPU compared to NumPy which takes 6.6 seconds

    for i in range(700):
        distance[i,:] = np.abs(np.broadcast_to(X[i,:], X.shape) - X).sum(axis=1)

This vectorizes and makes the distance of one vector to all other ones in parallel.

emcastillo
  • 306
  • 1
  • 4
  • I'm about to go to bed, ill try this in the morning but can you explain physically what broadcast_to() is doing? I like to understand how the data is moving between CPU and GPU – Moose Sims Oct 22 '20 at 07:37
  • broadcast_to creates an array with the shape of X but with all its entries being the X[i,:] vector, a broadcasted array is a view that uses the strides to always access the same row from the original array, so no extra memory is required. Data is not being moved from CPU to the GPU at all. it lives in the GPU all the time. – emcastillo Oct 22 '20 at 07:42
  • I'm working the code out in my head and this implementation seems brilliant. Something so simple and yet it eluded me so hard haha. Do you have any good references for GPU programming. – Moose Sims Oct 22 '20 at 07:48
  • when using cupy, you should try to vectorize all the operations that you can and avoid single item assignments, broadcast and fiddling with strides is pretty important, but this is pretty similar to NumPy. You can read https://ipython-books.github.io/46-using-stride-tricks-with-numpy/ If you want to do your custom kernels, the CuPy docs have some examples https://docs.cupy.dev/en/stable/tutorial/kernel.html – emcastillo Oct 22 '20 at 07:53
  • I wonder how this compares to scikit-learn `pairwise_distances`, scipy `cdist`, etc. – Sterling Aug 17 '21 at 05:44
  • `start = time(); cdist(X, X); end = time(); print("Elapsed = %s" % (end - start))` gives 1.084 s for me. – Sterling Aug 21 '21 at 06:54
  • `from sklearn.metrics import pairwise_distances; start = time(); pairwise_distances(X, X, n_jobs=-1); end = time(); print("Elapsed = %s" % (end - start))` gives me ~0.12 s, while the CuPy broadcasting is still the fastest at ~0.045 s. – Sterling Aug 21 '21 at 07:03