1

I have a large amount of data (over 10^5 points). I am searching a fast algorithm which finds all points in the dataset, which lie in a circle given by its center point and radius.

I thought about using an kd-tree to calculate for example the 10 nearest points to the circle's center, and then check if they are inside the circle. But I am not sure if this is the correct way.

loop_
  • 21
  • 5
  • 1
    One tip is not to take the square root... leave the distance as the sum of 2 squares but compare against the square of the radius. – Mark Setchell Jan 29 '20 at 09:10
  • I would start with NumPy arrays (so C under the hood) but if the algorithm must be in Python, look at [numba](https://numba.pydata.org/) in order to speed it up further. Working on the GPU might also be an option here, see for example [pygpu](https://fileadmin.cs.lth.se/cs/Personal/Calle_Lejdfors/pygpu/). 10E5 points is not really large. That should take only fractions of a second. – NoDataDumpNoContribution Jan 29 '20 at 09:41

5 Answers5

4

I benchmarked a numexpr version against a simple Numpy implementation as follows:

#!/usr/bin/env python3

import numpy as np
import numexpr as ne

# Ensure repeatable, deterministic randomness!
np.random.seed(42)

# Generate test arrays
N = 1000000
X = np.random.rand(N)
Y = np.random.rand(N)

# Define centre and radius
cx = cy = r = 0.5

def method1(X,Y,cx,cy,r):
    """Straight Numpy determination of points in circle"""
    d = (X-cx)**2 + (Y-cy)**2
    res = d < r**2
    return res

def method2(X,Y,cx,cy,r):
    """Numexpr determination of points in circle"""
    res = ne.evaluate('((X-cx)**2 + (Y-cy)**2)<r**2')
    return res

def method3(data,a,b,r): 
    """List based determination of points in circle, with pre-filtering using a square"""
    in_square_points = [(x,y) for (x,y) in data if a-r < x < a+r and b-r < y < b+r] 
    in_circle_points = [(x,y) for (x,y) in in_square_points if (x-a)**2 + (y-b)**2 < r**2] 
    return in_circle_points

# Timing
%timeit method1(X,Y,cx,cy,r)

%timeit method2(X,Y,cx,cy,r)

# Massage input data (before timing) to match agorithm
data=[(x,y) for x,y in zip(X,Y)]
%timeit method3(data,cx,cy,r)

I then timed it in IPython as follows:

%timeit method1(X,Y,cx,cy,r)                                                                                                                     
6.68 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit method2(X,Y,cx,cy,r)                                                                                                                     
743 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit method3(data,cx,cy,r)
1.11 s ± 9.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So the numexpr version came out 9x faster. As the points lie in the range [0..1], the algorithm is effectively calculating pi and the two methods come out the same:

method1(X,Y,cx,cy,r).sum()                                                                                                                       
784973

method2(X,Y,cx,cy,r).sum()                                                                                                                       
784973

len(method3(data,cx,cy,r))                                                                                                                       
784973

4 * 784973 / N
3.139

Note: I should point out that numexpr multi-threads your code across multiple CPU cores for you, automatically. If you feel like experimenting, with the number of threads, you can change it dynamically before calling method2(), or even inside there, with:

# Split calculations across 6 threads
ne.set_num_threads(6) 

Anyone else wishing to test the speed of their method is welcome to use my code as a benchmarking framework.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
2

A space-partitioning data structure like a k-d tree or quadtree could answer your query exactly; no need for a heuristic like "10 nearest points".

You can recursively search the tree:

  1. If the current node's bounding rectangle does not intersect the circle at all, ignore it.
  2. Otherwise, if the current node's bounding rectangle is completely contained within the circle, return all points within it.
  3. Otherwise, if the current node's bounding rectangle partially overlaps with the circle:
    • If the current node is a leaf, test each point individually to see if it's within the circle, and return the ones which are.
    • Otherwise, recurse on the current node's children, and return all of the points returned by the recursive calls.

For step 2, the rectangle is fully contained in the circle if and only if its four corners are. For steps 1 and 3, you can either test whether the rectangle intersects the circle's bounding box, and conservatively recurse because the rectangle might intersect the circle; or you can do an exact (but slightly more complicated) check between the rectangle and the actual circle, saving a few unnecessary recursive calls.

Since this is Python, instead of returning the points in a collection and then concatenating/unioning the results of recursive calls, you could yield the points from a recursive generator function. That simplifies the implementation somewhat.

kaya3
  • 47,440
  • 4
  • 68
  • 97
  • I find this idea quite interesting. I will try to implement it and hope that I can post my code if there is any trouble with it. – loop_ Jan 29 '20 at 09:33
  • 1
    @loop_ It is worth noting that a KD-Tree creation complexity is `O(n*log(n))` and that a simple search like the one proposed by others is `O(n)` only. If, for this dataset, you only want to perform 1 search, then it is not efficient to build a KD-Tree. However, if you need to perform many search within the same dataset (more that `k*log(n)`, where `k` is a factor depending on the KD-Tree creation algorithm), then it is worth using a KD-Tree because the creation complexity will be amortized with time. – Gilles-Philippe Paillé Jan 29 '20 at 14:11
  • Assuming you are doing many queries, I'd still recommend benchmarking against the solution Mark Setchell posted - specifically because this is Python, so a theoretically efficient data structure written in pure Python can easily lose to a simple one implemented in C (as `numpy` is) even for n ~ 10^5 or more. – kaya3 Jan 29 '20 at 14:21
1

To check whether a point (a, b) is within a circle of center (x, y) and radius r, then you can simply do a computation:

within_circle = ((x-a)**2 + (y-b)**2) <= r*r)

This equation uses the property of the circle on which can get the absolute distance to a point (which is also used in the distance formula if you noticed).

Seraph Wedd
  • 864
  • 6
  • 14
1

If you want first to filter a large amount of your dataset without huge computations, you can use the Square of size (2r x 2r) with the same center as the circle (where r is the circle's radius).

Have a look at this picture : Square outside circle

If you have the center's coordinates (a,b) and r the radius, then the points (x,y) inside the square verify :

in_square_points = [(x,y) for (x,y) in data if a-r < x < a+r and b-r < y < b+r]

And finally after this filter you can apply the circle equation :

in_circle_points = [(x,y) for (x,y) in in_square_points if (x-a)**2 + (y-b)**2 < r**2]

** EDIT **

if your input is structured like this :

data = [
    [13, 45],
    [-1, 2],
    ...
    [60, -4]
]

Then you can try, if you prefer common for-loops :

in_square_points = []
for i in range(len(data)):
    x = data[i][0]
    y = data[i][1]
    if a-r < x < a+r and b-r < y < b+r:
        in_square_points.append([x, y])
print(in_square_points)
Phoenixo
  • 2,071
  • 1
  • 6
  • 13
  • Thank you, I will also try this approach. Since I am not used to the short notation of the loops, is there a way to also get the index of the (x,y)? – loop_ Jan 29 '20 at 09:58
  • I updated my answer to rewrite the algorithm with a common for-loop, tell me if it's still not what you need – Phoenixo Jan 29 '20 at 10:26
  • This is what I actually just implemented. And it works pretty good. Thank you very much! – loop_ Jan 29 '20 at 11:24
1

If you are only interested in the number of points which are in the circle you can try Numba.

import numpy as np
import numba as nb
import numexpr as ne

def method2(X,Y,cx,cy,r):
    """Numexpr method"""
    res = ne.evaluate('((X-cx)**2 + (Y-cy)**2) < r**2')
    return res

@nb.njit(fastmath=True,parallel=True)
def method3(X,Y,cx,cy,r):
    acc=0
    for i in nb.prange(X.shape[0]):
        if ((X[i]-cx)**2 + (Y[i]-cy)**2) < r**2:
            acc+=1
    return acc

Timings

# Ensure repeatable, deterministic randomness!
np.random.seed(42)

# Generate test arrays
N = 1000000
X = np.random.rand(N)
Y = np.random.rand(N)

# Define centre and radius
cx = cy = r = 0.5

#@Mark Setchell
%timeit method2(X,Y,cx,cy,r)
#825 µs ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit method3(X,Y,cx,cy,r)
#480 µs ± 94.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
max9111
  • 6,272
  • 1
  • 16
  • 33