Why Python code takes so long to find close points in a dataset?

Question

I am trying to make an efficient Python code that, given a set of data points in the form

a1 a2 a3 ... an (point 1)
b1 b2 b3 ... bn (point 2)
         .
         .
         .

will find which of these are too close (within a threshold value).

I already have the following Fortran routine that solves this problem:

program check_closeness                                           
    implicit none
    integer, parameter :: npoints = 10000, ndis = 20
    real(8), parameter :: r = 1e0, maxv = 3e0, minv = 0e0
    real(8), dimension(npoints,ndis) :: dis
    real(8) :: start, ende

    call RANDOM_NUMBER(dis)
    dis = (maxv - minv) * dis + minv

    call cpu_time(start)
    call remove_close(npoints, ndis,dis, r)
    call cpu_time(ende)
    write(*,*) 'Time elapsed', ende - start
endprogram

subroutine remove_close(npoints, ndis, points, r)
    implicit none
    integer, intent(in) :: npoints, ndis
    integer :: i, i_check
    real(8), dimension(npoints, ndis), intent(in) :: points
    real(8), intent(in) :: r
    logical :: is_close

    do i=1,npoints
       is_close = .FALSE.
       i_check = i

       do while (.not. is_close .and. i_check < npoints)
          i_check = i_check + 1
          is_close = all(abs(points(i,:) - points(i_check,:)) < r)
       enddo
    enddo
end subroutine

This execution takes in my computer: Time elapsed 1.0651889999999999 (seconds).

Now, I write the same exact code (or so I think) in Python:

 import numpy as np                                       
 import time
 
 def check_close(a, r):
     npoints = a.shape[0]
     for i, vec1 in enumerate(a):
         is_close = False
         icheck = i
 
         while (not is_close and icheck < npoints - 1):
             icheck +=1
             vec2 = a[icheck,:]
             is_close = all(np.abs(vec1 - vec2) < r)
 
 maxv, minv = 3, 0
 a = np.random.rand(10000, 20)
 a = (maxv - minv) * a + minv
 r = 1e0
 
 ini = time.time()
 check_close(a, r)
 fin = time.time()
 print('Time elapsed {}'.format(fin - ini))

This execution takes Time elapsed 102.60617995262146 (seconds), which is considerably slower than the Fortran one.

I tried yet another Python version of this routine which is considerably faster, but still does not approach the Fortran one:

import numpy as np                        
import time

def check_close(a, r):
    for i, vec1 in enumerate(a[:-1]):
        d = np.abs(a[i+1:,:] - vec1)
        is_close = np.any(np.all(d < r, axis=1))

maxv, minv = 3, 0
a = np.random.rand(10000, 20)
a = (maxv - minv) * a + minv
r = 1e0

ini = time.time()
check_close(a, r)
fin = time.time()
print('Time elapsed {}'.format(fin - ini))

In this case the execution takes Time elapsed 3.4987785816192627 (seconds). From this, I guess the improvement comes from the vectorization achieved and removing the while loop. On the other hand, this implementation does not benefit from stopping the search as one close point is found.

My questions are:

What is it that takes so much time in the Python implementations?
Is there any way I could rewrite the Python code to make if (almost) as fast as the Fortran one?

[Edit]

I was asked to use time.perf_counter() to measure executions times in Python codes.

Now I use:

ini = time.perf_counter()
check_close(a, r)
fin = time.perf_counter()
print('Time elapsed {}'.format(fin - ini))

to measure them and the updated times are:

Python implementation 1: Time elapsed 98.63728801719844 (seconds)

Python implementation 2: Time elapsed 3.3923211600631475 (seconds)

Don't use `time.time` to benchmark code as it measures wall time (i.e. it includes everything that runs on your system). Either use [`time.perf_counter`](https://docs.python.org/3/library/time.html#time.perf_counter) or [`process_time`](https://docs.python.org/3/library/time.html#time.process_time). Then please update your question with the new numbers. — a_guest, Feb 18 '21 at 11:01
@a_guest, i dont see how getting a 10% better elapsed time is going to help at all — pytness, Feb 18 '21 at 11:11
Note that this may not be an especially efficient way to do this. First put the points in bins of width r. Then the only points that can be close are in the same bin as the reference point, or in neighbouring bins. The resulting algorithm will be O(N) rather than the O(N**2) you have here. For large N this may make the code orders of magnitude faster than you have here, irrespective of your choice of language. I am using may as I have only used such methods in 3D, and you have a somewhat higher dimensionality. — Ian Bush, Feb 18 '21 at 11:14
@pytness I don't understand your comment. Numbers reported via `time.time` are not reliable hence I asked to update these numbers. — a_guest, Feb 18 '21 at 11:15
Also note real(8) is not portable in Fortran and will fail for certain compilers. Learn about the intrinsic iso_fortran_env module and see https://stackoverflow.com/questions/838310/fortran-90-kind-parameter — Ian Bush, Feb 18 '21 at 11:16
I assume that you use np.abs() to determine the distance between datapoints, is that correct? I think that use is incorrect. In that case you are looking for the 'norm' funciton in numpy's linear algebra module. — Stijn De Pauw, Feb 18 '21 at 11:31
@a_guest I updated the code as you ask, and have edit the entry to show new execution times. — pablo, Feb 18 '21 at 11:49
@StijnDePauw. I am concerned about that. I'd like to focus more in what is happening in python to make the same implementation so slow. — pablo, Feb 18 '21 at 11:50
So I played around a bit with np.prod and it seems like your numpy implementation is probably close to optimal. The issue is that numpy doesn’t work with iteration, so one would have to preallocate a ridiculous amount of memory. You could use numba to jit compile your function, which speeds up loops to near native (there are also parallel iterators) or wrap your fortran code. — AlexNe, Feb 18 '21 at 12:02
@IanBush Could you provide me more information about that algorithm? If it has any name it is enough, I will search it. Thank you very much! — pablo, Feb 18 '21 at 17:28
@pablo https://en.wikipedia.org/wiki/Cell_lists covers it in 3D — Ian Bush, Feb 18 '21 at 17:35
This might be useful: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances_chunked.html — moooeeeep, Feb 19 '21 at 12:33

score 2 · Answer 1 · answered Feb 18 '21 at 11:40

Numpy was created specifically BECAUSE python is incredibly slow. As you showed, fortran code is much faster without much algorithmic differences. So to answer your questions:

Python is a scripting language and was not built around speed. If you need speed, use libraries such as numpy, like you are doing, or try to incorporate native C/C++ code into your program where you need it.
To make it as fast as fortran, try to find a way to only use numpy routines, getting rid of the for loop. This will drastically increase your performance.

Try to think about your problems and how you can solve them without loops; this will be the key to unlock speed in python.

By the way, I am not sure what you intend with your use of np.abs() but it is not calculating the euclidean distance between two points. Use np.linalg.norm() for that.

Similarly in Fortran norm2 calculates the 2-norm of an array. But if you really want performance you'll avoid the square root and compare the square of the lengths — Ian Bush, Feb 18 '21 at 11:51
Not sure where you got the Euclidean distance from. OP apparantly implements Chebyshev distance. Which is probably not relevant for either Q&A. — moooeeeep, Feb 19 '21 at 12:03

Kevin S · Answer 2 · 2021-03-04T12:10:30.380

You can do better in Python, but it takes some work. Surprisingly I think your approach using only NumPy is about as good as you can get without involving other modules. However, if you are willing to use numba you can do better than even Fortran.

The function below returns the close vectors. It is effectively C-code written in Python that is then compiled to native using numba.

import numpy as np
from math import fabs
from numba import njit

@njit
def check_close_nb2(a, r):
    n = a.shape[0]
    m = a.shape[1]
    # Selection array if you want to use it later
    close = np.zeros(n, dtype=np.bool_)
    for i in range(n):
        for j in range(i + 1, n):
            for k in range(m):
                # Short circuit comparison
                last = fabs(a[i, k] - a[j, k]) >= r
                if last:
                    break
                # If we reach here we found a close
            if k == (m - 1) and not last:
                # Once a close value is found stop looking
                close[i] = 1
                break
    return close

When I run this on my machine using %timeit I get

%timeit check_close_nb2(a, r)
544 ms ± 4.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

When I run your code above, I see

%timeit check_close(a, r)
4.91 s ± 69.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

so my comparison suggests that my machine is a bit slower than yours, and would suggest a time around 400ms for the numba version.

Willem Hendriks · Answer 3 · 2021-02-19T08:24:46.840

I have tried 2 methods, both under 3 seconds on my machine, but not yet as fast as FORTRAN.

I generate similar random data, but increased the r as otherwise I got zero cases

import time

import numpy as np

maxv, minv = 3, 0

a = np.random.rand(10000, 20)
a = (maxv - minv) * a + minv
r = 3e0

1 - BallTree (sklearn)

from sklearn.neighbors import BallTree

ini = time.time()
tree = BallTree(a, leaf_size=5000)

within = tree.query_radius(a, r=r)

fin = time.time()

print('Balltree (sklearn) based - time elapsed {}'.format(fin - ini))

Which gave

Balltree (sklearn) based - time elapsed 1.5845096111297607

edit: Weirdly (can someone explain?) an insane high leafsize=5000 worked best in my case. Normally the leafsize is around 10.

2 - pdist (scipy)

from scipy.spatial.distance import pdist


ini = time.time()

condensed_distance_small = pdist(a) <= r
condensed_indici = np.argwhere(condensed_distance_small)

n = 10000

#
# i,j contain the indici of points that are within r of each other
# based on condensed-indici formulas of
# https://stackoverflow.com/questions/13079563/how-does-condensed-distance-matrix-work-pdist

i = np.ceil((1/2.) * (- (-8*condensed_indici + 4 *n**2 -4*n - 7)**0.5 + 2*n -1) - 1).astype(int)
elem_in_i_rows = (i+1) * (n - 1 - (i+1)) + ((i+ 1)*(i + 2))//2
j = (n - elem_in_i_rows + condensed_indici)

fin = time.time()

print('pdist (scipy) based - time elapsed {}'.format(fin - ini))

Which gave

pdist (scipy) based - time elapsed 0.6769788265228271

the pdist is returning condensed form indici of point, so there is some (i,j) index administration to be done, but this is done with vectors=fast.

Thank you very much. Notice I have just updated Fortran time, I made an error, so your second method is actually better in time. The problem with it is memory. For my typical problems I will have to load about 1e5 data points, so the distance matrix cannot fit in memory. — pablo, Feb 19 '21 at 09:21
@pablo You should mention that constraint (_the distance matrix doesn't fit in memory_) in the question post. — moooeeeep, Feb 19 '21 at 12:05
pdist isnt storing the full distance matrix, but about half of it. might be just the point where memory problem occur indeed. balltree and pdist support multiple distance metrics, also good to know. i made (wrong?) assumption about metic used. — Willem Hendriks, Feb 19 '21 at 12:31
The pdist solution does not give the same answer as the original posts `check_close` for close points. Do you need to change the metric? Looks like it should be `"chebyshev"`. This also speeds up the solution a good bit (670ms -> 400ms) on my machine and produces the correct answer (after `np.unique` to deduplicate `i`). — Kevin S, Mar 04 '21 at 12:14

Why Python code takes so long to find close points in a dataset?

3 Answers3

1 - BallTree (sklearn)

2 - pdist (scipy)