creating distance matrix iteratively efficiently while preserving iteration process in Python 3 / Numpy

Question

i have a matrix A and want to calculate the distance matrix D from it, iteratively. The reason behind wanting to calculate it step by step is to later include some if-statements in the iteration process.

My code right now looks like this:

import numpy as np
from scipy.spatial import distance

def create_data_matrix(n,m):
    mean = np.zeros(m)
    cov = np.eye(m, dtype=float)
    data_matrix = np.random.multivariate_normal(mean,cov,n)
    return(data_matrix)
def create_full_distance(A):
    distance_matrix = np.triu(distance.squareform(distance.pdist(A,"euclidean")),0)
    return(distance_matrix)

matrix_a = create_data_matrix(1000,2)
distance_from_numpy = create_full_distance(matrix_a)
matrix_b = np.empty((1000,1000))
for idx, line in enumerate(matrix_a):
    for j, line2 in enumerate(matrix_a):
        matrix_b[idx][j] = distance.euclidean(matrix_a[idx],matrix_a[j])

Now the matrices "distance_from_numpy" and "matrix_b" are the same, though matrix_b takes far longer to calculate allthough the matrix_a is only a (100x2) matrix, and i know that "distance.pdist()" method is very fast but i am not sure if i can implement it in an iteration process.
My question is, why is the double for loop so slow and how can i increase the speed while still preserving the iteration process (since i want to include if statements there) ?

edit: for context: i want to preserve the iteration, because i'd like stop the iteration if one of the distances is smaller than a specific number.

how large is your matrix going to be? if it is 100x2, might be faster to use np.linalg.norm to calculate all distances and then check for threshold. — Ehsan, Jun 25 '20 at 12:18
the data matrix right now is 100x2 but at some point i will have a (100000x1000) data matrix. so the corresponding distance matrices will be 100x100 and 100000x100000 — James, Jun 25 '20 at 12:22
This is quite large. The call to `distance.euclidean` might take long enough, so that the Python for-loops don't matter anymore. I suggest you first compare the timing of `distance_from_numpy` vs `matrix_b` on something like (100x1000). — Feodoran, Jun 25 '20 at 12:28
well on the above presented example the difference of calculation speed is "feelable" since the numpy created full distance matrix "distance_from_numpy" is calculated instantly and the for loop created distance matrix "matrix_b" takes a second or two and both are only 1000x1000 — James, Jun 25 '20 at 12:40
This is quite large. Try numba and hand calculating the distance yourself. — Ehsan, Jun 25 '20 at 12:41
ok thank you very much, i tried numba out and it significantly increased computation speed. I tried it on the above 1000x1000 distance matrix so i am not sure yet how 100000x100000 matrices will do though. — James, Jun 25 '20 at 13:13
For larger matrices you can try a approximation https://stackoverflow.com/a/53380192/4045774 or https://stackoverflow.com/a/42994680/4045774 if you don't run out of memory. — max9111, Jun 25 '20 at 14:47
The double loop means you are calling your function 1000*1000 times. It's those many calls to a Python function that's taking up most of the time. — hpaulj, Jun 25 '20 at 15:18

score 0 · Answer 1 · answered Jun 25 '20 at 12:09

0

Python is a high-level language and therefore loops are inherently slow. It just has to deal with a lot of overhead. This gets progressively worse, as the number of nested loops increases. On the other hand, Numpy uses fast Fortran code.

To speed up the Python implementation, you can for example implement the loop part with Cython, which will translate your code to C, and then compile it for faster execution. Other options are Numba, or writing the loops in Fortran.

answered Jun 25 '20 at 12:09

Feodoran

1,752
1
14
31

Since numpy uses fortran code, shouldn´t i be able to convert the for loops in some kind of numpy methods ? i will look into Cython thank you. – James Jun 25 '20 at 12:13
What do you mean by that? How would that be different from your `create_full_distance()` function? – Feodoran Jun 25 '20 at 12:16
i could use a while loop instead of the if statement, would that help ? – James Jun 25 '20 at 12:16
instead of checking after each calculated distance value d(i,j) if it`s smaller than a certain number i could use a "while x > d(i,j): ..." loop – James Jun 25 '20 at 12:24
I would have to see the actual code, to know at what point exactly you want to put the while loop. But I think this will only help if you have a lot of cases that are skipped this way. If it is only 1% of the cases, it won't help much. – Feodoran Jun 25 '20 at 12:31

score 0 · Accepted Answer · answered Jun 25 '20 at 13:27

As Ehsan mentioned in a comment i used numba to increase computational speed.

from numba import jit
import numpy as np
from scipy.spatial import distance

def create_data_matrix(n,m):
    mean = np.zeros(m)
    cov = np.eye(m, dtype=float)
    data_matrix = np.random.multivariate_normal(mean,cov,n)
    return(data_matrix)
    
def create_full_distance(A):
    distance_matrix = np.triu(distance.squareform(distance.pdist(A,"euclidean")),0)
    return(distance_matrix)

@jit(nopython=True) # Set "nopython" mode for best performance, equivalent to @njit
def slow_loop(matrix_a):
    matrix_b = np.empty((1000,1000))
    for i in range(len(matrix_a)):
        for j in range(len(matrix_a)):
            #matrix_b[i][j] = distance.euclidean(matrix_a[i],matrix_a[j])
            matrix_b[i][j] = np.linalg.norm(matrix_a[i]-matrix_a[j])
    print("matrix_b: ",matrix_b)
    return()

def slow_loop_without_numba(matrix_a):
    matrix_b = np.empty((1000,1000))
    for i in range(len(matrix_a)):
        for j in range(len(matrix_a)):
            matrix_b[i][j] = np.linalg.norm(matrix_a[i]-matrix_a[j])
    return()
    
matrix_a = create_data_matrix(1000,2)

start = time.time()
ergebnis = create_full_distance(matrix_a)
#print("ergebnis: ",ergebnis)  
end = time.time()
print("with scipy.distance.pdist = %s" % (end - start))

start2 = time.time() 
slow_loop(matrix_a)
end2 = time.time()
print("with @jit onto np.linalg.norm = %s" % (end2 - start2))

start3 = time.time()
slow_loop_without_numba(matrix_a)
end3 = time.time()
print("slow_loop without numba = %s" % (end3 - start3))

i executed the code and it yielded these results:

with scipy.distance.pdist = 0.021986722946166992
with @jit onto np.linalg.norm = 0.8565070629119873
slow_loop without numba = 6.818004846572876

so numba increased the computational speed by alot allthough scipy is still much faster. This will be more interesting the bigger the distance matrices get. I couldn´t use numba on a function with scipy methods.

For Numba just write out the loop without using np.linalg.norm. eg. https://stackoverflow.com/a/60854278/4045774 Also measure the second call and not the first one, to get the runtime. (The first run is compilation time + runtime) — max9111, Jun 25 '20 at 14:36

creating distance matrix iteratively efficiently while preserving iteration process in Python 3 / Numpy

2 Answers2