Numpy: Single loop vectorized code slow compared to two loop iteration

Question

The following codes iterates over each element of two array to compute pairwise euclidean distance.

def compute_distances_two_loops(X, Y):
    num_test = X.shape[0]
    num_train = Y.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
        for j in range(num_train):
            dists[i][j] = np.sqrt(np.sum((X[i] - Y[j])**2))
    return dists

The following code serves the same purpose but with single loop.

def compute_distances_one_loop(X, Y):
    num_test = X.shape[0]
    num_train = Y.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
        dists[i, :] = np.sqrt(np.sum((Y - X[i])**2, axis=1))
    return dists

Below are time comparison for both.

two_loop_time = time_function(compute_distances_two_loops, X, Y)
print ('Two loop version took %f seconds' % two_loop_time)

>> Two loop version took 20.3 seconds

one_loop_time = time_function(compute_distances_one_loop, X, Y)
print ('One loop version took %f seconds' % one_loop_time)

>> One loop version took 80.9 seconds

Both X and Y are numpy.ndarray with shape -

X: (500, 3000) Y: (5000, 3000)

Out of intuition the results are not correct, the single loop should run at least with same speed. What am I missing here ?

PS: The result is not from a single run. I ran the code number of times, on different machines, the results are similar.

I may be missing something here but where does the `self` come from in your `one_loop` function? for that matter, `self.X_train`. — Gene Burinsky, Sep 15 '16 at 03:47
Just to make sure, is the mismatch is `shape` between `X` and `Y` deliberate or a typo? — Gene Burinsky, Sep 15 '16 at 04:02
When the 2nd dim of the arrays is 300, `one_loop` is 2x faster; when 3000, it's the other way, it's half the speed. That's with 50 and 500 for the 1st. My guess it's running into memory management issues. — hpaulj, Sep 15 '16 at 04:24
did you notice that on smaller matrices, say `Y = np.random.randn(1000,300)` and `X = np.random.randn(100,300)`, the one loop is faster on my machine. — Gene Burinsky, Sep 15 '16 at 04:24
You might meaure the time spent on `np.sqrt(np.sum((X[i] - Y[j])**2))` and `np.sqrt(np.sum((Y - X[i])**2, axis=1))` — percusse, Sep 15 '16 at 07:20

score 2 · Accepted Answer · edited Nov 19 '18 at 16:35

The reason is size of arrays within the loop body. In the two loop variant works on two arrays of 3000 elements. This easily fits into at least the level 2 cache of a cpu which is much faster than the main memory but it is also large enough that computing the distance is much slower than the python loop iteration overhead.

The second case the loop body works on 5000 * 3000 elements. This is so much that the data needs go to main memory in each computation step (first the Y-X[i] subtraction into a temporary array, squaring the temporary into another temporary and then read it back to sum it). The main memory is much slower than the cpu for the simple operations involved so it takes much longer despite removing a loop. You could speed it up a bit by using inplace operations writing into preallocated temporary array, but it will still be slower than the two loop variant for these array sizes.

Note that scipy.spatial.distance.cdist(X, Y) is probably going to be the fastest, as it does not need any temporaries at all

Didn't came around for a second that cache could be playing it's role . Thanks ! @jtaylor — Shubham Chandel, Sep 17 '16 at 00:54
The answer is a bit misleading I think. The two-loop version doesn't work on two arrays - it works at each iteration on two values, so very little data is being copied between Numpy C-code and the Python interpreter code. However, in the one-loop version, an array of the same size as `Y` is being constructed and returned. This is what's taking so much time - I agree - but the way you read the number of elements is a bit off if I'm not mistaken. — Yuval, Feb 11 '22 at 23:16

Numpy: Single loop vectorized code slow compared to two loop iteration

1 Answers1

Linked