How to speed up nested for in Python?

Question

I have a list with descriptors (arrays) and need to compute the cosine distance between then, which is a lot of computation. For a list with n vectors it does n * (n - 1) / 2 operations. I need to do that in a big n value, so, how can I speed up this process using multiprocessing? I found some answers but is not clear for me how to do in my case.

The code that I want to speed up is the code below:

from scipy.spatial.distance import cosine as cosine_distance

n = len(all_vectors)
all_distances = []
for i in range(n):
    for j in range(i + 1, n):
        x1 = all_vectors[i]
        x2 = all_vectors[j]
        all_distances.append(cosine_distance(x1, x2))

[UPDATE]

I also need to do a little label check, so, the original code is the following:

from scipy.spatial.distance import cosine as cosine_distance

n = len(all_vectors)
all_distances = []
all_label_checks = []
for i in range(n):
    for j in range(i + 1, n):
        x1 = all_vectors[i]
        x2 = all_vectors[j]
        all_distances.append(cosine_distance(x1, x2))
        
        label_x1 = all_labels[i]  # all_labels and all_distances are tied
        label_x2 = all_labels[j]
        all_label_checks.append(int(label_x1 == label_x2))

I tested the time of some suggestion, and the best answer so far is from @DaniMesejo. Here is the code that I used to test each case:

import time
import numpy as np

from scipy.spatial.distance import pdist, cosine as cosine_distance, squareform
from itertools import combinations, starmap


EMBEDDINGS_SIZE = 128
N_EMBEDDINGS = 1024


def get_n_embeddings(n):
    return np.random.randn(n, EMBEDDINGS_SIZE)


embeddings = get_n_embeddings(N_EMBEDDINGS)

###################
# scipy pdist
###################
start = time.time()
pdist_distances = pdist(embeddings, 'cosine')
end = time.time()
print(f"Using scipy pdsit: {end - start}")

###################
# nested loop
###################
nested_distances = []
start = time.time()
for i in range(N_EMBEDDINGS):
    for j in range(i + 1, N_EMBEDDINGS):
        x1 = embeddings[i]
        x2 = embeddings[j]
        nested_distances.append(cosine_distance(x1, x2))
end = time.time()
print(f"Using nested loop: {end - start}")

###################
# combinations
###################
start = time.time()
combination_distances = []
for x1, x2 in combinations(embeddings, 2):
    combination_distances.append(cosine_distance(x1, x2))
end = time.time()
print(f"Using combination: {end - start}")

###################
# starmap
###################
start = time.time()
starmap_distances = list(starmap(cosine_distance, combinations(embeddings, 2)))
end = time.time()
print(f"Using starmap: {end - start}")

print(np.allclose(pdist_distances, nested_distances))
print(np.allclose(pdist_distances, pdist_class_distances))
print(np.allclose(pdist_distances, combination_distances))
print(np.allclose(pdist_distances, starmap_distances))

And the results:

Using scipy pdsit: 0.0303647518157959
Using pdsit and class: 0.09841656684875488
Using nested loop: 13.415924549102783
Using combination: 13.093504428863525
Using starmap: 13.177483081817627
True
True
True
True

To solve my label problem I also can use pdist. I just need to transform my label list (that tells what is the label of each embedding) in a list of 2d and compute the pairwise distance. Pairs of the same label will have distance 0:

###################
# pdist and class
###################
classes = []
count = 0
id_ = 0
for i in range(N_EMBEDDINGS):
    classes.append(id_)
    count += 1
    if count % 4 == 0:
        count = 0
        id_ += 1
labels_x = [(i, i) for i in classes]
start = time.time()
pdist_class_distances = pdist(embeddings, 'cosine')
pdist_labels = pdist(labels_x, 'euclidean')
pdist_labels = [1 if x == 0.0 else 0 for x in pdist_labels]
end = time.time()
print(f"Using pdsit and class: {end - start}")

Not saying multiprocessing is necessarily the ideal solution, but typically he way you'd use multiprocessing is to have each process only work through a chunk of `all_vectors`, and then you'd concatenate the results together into one list at the end. Does that make sense? — Random Davis, Nov 20 '20 at 20:20
An initial thing to look at would be to put the `for j in range(i + 1, n)` into a thread and collect the results as they complete. Take a look at some things like https://stackoverflow.com/questions/6893968/how-to-get-the-return-value-from-a-thread-in-python to do an initial janky implementation and dip your toes in threading. Start by turning the content of the `j` for loop into a function that accepts i and n and returns x1/x2, or does coside_distance and returns the result of that since it's likely the heavy lifting going on — AsciiFace, Nov 20 '20 at 20:22
Did you check [cdist](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html) using the cosine distance metric? Its capable of computing the distance between each pair of the two collections of inputs — DarrylG, Nov 20 '20 at 20:25
@AsciiFace: Threading doesn't get you anything for CPU bound work in CPython unless the work is being done in a C extension that explicitly releases the GIL while doing a *lot* of work. You'd have to go with `multiprocessing` or `concurrent.futures.ProcessPoolExecutor` to do the work in separate processes (which aren't GIL bound, but aren't always faster due to IPC overhead if the work done is too small). — ShadowRanger, Nov 20 '20 at 20:27
@ShadowRanger I've only done threading in go recently so I forgot how busted python's threading model is — AsciiFace, Nov 20 '20 at 20:41
@AsciiFace: Technically, it's not *Python's* threading model. The language spec doesn't require it. But the two most up-to-date implementations of Python, the CPython reference interpreter and the PyPy optimizing JIT-ing interpreter, are both GIL-bound (Stackless Python doesn't appear to use actual threads at all; it's equivalent to GIL bound). The only two without a GIL to my knowledge are IronPython and Jython, neither of which has managed to support the Python 3 spec (they're still only 2.7 compatible, over 10 years after Py3 released, and after CPython dropped support for 2.7 entirely). — ShadowRanger, Nov 20 '20 at 20:49
@AsciiFace: So you're technically wrong, but right in all ways that matter. :-) — ShadowRanger, Nov 20 '20 at 20:49
@ShadowRanger yeah no I was entirely thinking in the land of actually utilizing multiple CPUs, way off base lol - I never reach for python when I need concurrency and was out of my depth — AsciiFace, Nov 20 '20 at 21:35

Dani Mesejo · Accepted Answer · 2020-11-20T20:40:05.900

A possible approach is to use the scipy function pdist, for example:

from scipy.spatial.distance import pdist, cosine as cosine_distance, squareform
import numpy as np


def all_pairs_using_pdist(X):
    """Function using pdist"""
    distances = squareform(pdist(X, 'cosine'))
    rows, _ = distances.shape
    return list(distances[np.triu_indices(rows, 1)])


def all_pairs_for_loops(all_vectors):
    """Function using a nested for loop"""
    n = len(all_vectors)
    all_distances = []
    for i in range(n):
        for j in range(i + 1, n):
            x1 = all_vectors[i]
            x2 = all_vectors[j]
            all_distances.append(cosine_distance(x1, x2))
    return all_distances

For a matrix like the following:

Y = np.random.randn(100, 25)

It gives the timings:

%timeit all_pairs_using_pdist(Y)
630 µs ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit all_pairs_for_loops(Y)
216 ms ± 78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

About 300 times faster. And of course both approaches, yield the same result:

Y = np.random.randn(100, 25)
print(np.allclose(all_pairs_for_loops(Y), all_pairs_using_pdist(Y)))

Output

True

Your answer seems pretty good, thank you very much. I only have one extra doubt. How can I adapt your solution to do a little verification? My original problem contains `all_vectors` and `all_labels`. For each comparison between two vectors `x1` and `x2`, I also check if they share the same label. I append the cosine distance in one list, and `1`or `0` in another list if they are from the same class or not. — Matheus Santos, Nov 23 '20 at 14:43
For example: ```python for i in range(n): for j in range(i + 1, n): x1 = all_vectors[i] x2 = all_vectors[j] all_distances.append(cosine_distance(x1, x2)) label_x1 = all_labels[i] label_x2 = all_labels[j] all_labels_check.append(int(label_x1==label_x2)) ``` — Matheus Santos, Nov 23 '20 at 14:45

score 0 · Answer 2 · answered Nov 20 '20 at 20:22

How much this will help I can't say, but your nested loop is basically an attempt to make all unique 2-combinations of legal indices to get all unique 2-combinations of the values. But there are built-ins to do this, so with a from itertools import combinations, this:

for i in range(n):
    for j in range(i + 1, n):
        x1 = all_vectors[i]
        x2 = all_vectors[j]

could simplify to this:

for x1, x2 in combinations(all_vectors, 2):

You could even layer all this over itertools.starmap to push the entire loop to the C layer (assuming cosine_distance is itself implemented in C; if not, it just limits the upside, but still works fine), simplifying all of your posted code to:

from scipy.spatial.distance import cosine as cosine_distance
from itertools import combinations, starmap

all_distances = list(starmap(cosine_distance, combinations(all_vectors, 2)))

and running faster to boot (meaningfully faster if cosine_distance is a relatively low cost operation implemented in C).

If that's not fast enough, odds are you'll have to use multiprocessing/concurrent.futures.ProcessPoolExecutor to parallelize (maybe multiprocessing.dummy/concurrent.futures.ThreadPoolExecutor if the scipy function is implemented in C and releases the GIL while running, allowing true thread parallelism, which is normally not available to CPU bound threads on CPython due to the GIL).

I already tested with `itertools.combinations` and doesn't speed up. I tried your suggestion of use `starmap` but also doesn't speed up. I test without calculating the cosine distance, it seems that the bottleneck is in to get all the possible combinations, even with built-in functions or not. — Matheus Santos, Nov 24 '20 at 19:50
@MatheusSantos: How large is `all_vectors`? If it's large enough, it's going to take time; not much to improve (aside from maybe parallelism). — ShadowRanger, Nov 24 '20 at 20:47
As big as possible. The solution of @DaniMesejo works, I only need to find how to include the label check that I mentioned in the update — Matheus Santos, Nov 24 '20 at 23:23

How to speed up nested for in Python?

2 Answers2