Most efficient way to select point with the most surrounding points

Question

N.B: there's a major edit at the bottom of the question - check it out

Question

Say I have a set of points:

I want to find the point with the most points surrounding it, within radius (ie a circle) or within +-R (ie a square) of the point for 2 dimensions. I'll refer to it as the densest point function.

For the diagrams in this question, I'll represent the surrounding region as circles. In the image above, the middle point's surrounding region is shown in green. This middle point has the most surrounding points of all the points within radius and would be returned by the densest point function.

What I've tried

A viable way to solve this problem would be to use a range searching solution; this answer explains further and that it has " worst-case time". Using this, I could get the number of points surrounding each point and choose the point with largest surrounding point count.

However, if the points were extremely densely packed (in the order of a million), as such:

then each of these million points ( 1e6 ) would need to have a range search performed. The worst-case time O(log(n) + k) , where is the number of points returned in the range, is true for the following point tree types:

kd-trees of two dimensions (which are actually slightly worse, at ),
2d-range trees,
Quadtrees, which have a worst-case time of

So, for a group of 1e6 points within radius of all points within the group, it gives complexity of O(log(1e6) + 1e6) for each point. This yields over a trillion operations!

Any ideas on a more efficient, precise way of achieving this, so that I could find the point with the most surrounding points for a group of points, and in a reasonable time (preferably O(n log(n)) or less)?

EDIT

Turns out that the method above is correct! I just need help implementing it.

(Semi-)Solution

If I use a 2d-range tree:

A range reporting query costs , for returned points,
For a range tree with fractional cascading (also known as layered range trees) the complexity is ,
For 2 dimensions, that is ,
Furthermore, if I perform a range counting query (i.e., I do not report each point), then it costs .

I'd perform this on every point - yielding the O(n log(n)) complexity I desired!

Problem

However, I cannot figure out how to write the code for a counting query for a 2d layered range tree.

I've found a great resource (from page 113 onwards) about range trees, including 2d-range tree psuedocode. But I can't figure out how to introduce fractional cascading, nor how to correctly implement the counting query so that it is of O(log n) complexity.

I've also found two range tree implementations here and here in Java, and one in C++ here, although I'm not sure this uses fractional cascading as it states above the countInRange method that

It returns the number of such points in worst case * O(log(n)^d) time. It can also return the points that are in the rectangle in worst case * O(log(n)^d + k) time where k is the number of points that lie in the rectangle.

which suggests to me it does not apply fractional cascading.

Refined question

To answer the question above therefore, all I need to know is if there are any libraries with 2d-range trees with fractional cascading that have a range counting query of complexity O(log(n)) so I don't go reinventing any wheels, or can you help me to write/modify the resources above to perform a query of that complexity?

Also not complaining if you can provide me with any other methods to achieve a range counting query of 2d points in O(log(n)) in any other way!

Do you need an exact answer, or is an approximate answer (to within some epsilon) alright? (Hint: an approximate answer is probably okay.) — Richard, Jun 05 '17 at 00:01
It needs to be precise in this instance. I've already provided how to achieve the approximate answer within the question - simply bucketing by a proportion of the radius would give me a way to do so within a percentage of R. — Nick Bull, Jun 05 '17 at 00:07
Do you know anything about your data, such as whether it is distributed homogeneously, clumps, &c? — Richard, Jun 05 '17 at 01:55
In your example pictures the grid has a radius of r not r/2. If you take r/2 for the buckets then you know that at least the points in the same bucket have a distance shorter than r. With r this is not guaranteed (diagonal is sqrt(2)*r which is > r). I would use squares, it's easier and faster to evaluate than circles. — maraca, Jun 05 '17 at 02:10
@maraca I agree! Also, it's totally random what distribution the points have. — Nick Bull, Jun 05 '17 at 02:18
@NickBull: With regards to your first answer, I trust that you do want an exact answer. But I do not believe that it was you need. You are talking about _many_ points in what appears to be a floating-point situation. I can create a set-up wherein two points are differentiated by only the smallest possible floating-point increment whose neighbours differ in population by 1 out of a million. It seems inconceivable to me that there's a meaningful difference between these in terms of application. Think about it. That said: I'll try to find an exact answer. — Richard, Jun 05 '17 at 03:32
@NickBull: When you say "it's totally random what distribution the points have" do you mean to say that "the distribution of the points is unknown and may be anything" or "the points have a uniform random distribution on the plane"? I assume you mean the former, but I'd like to be sure. — Richard, Jun 05 '17 at 03:48
@Richard lol, random was the worst possible word to use, it was 4am when I wrote that comment - the distribution is unknown. The problem's importance to me mostly derives from the middle point being the "target" (i.e., not returning this point is downright wrong), without being able to necessarily just take an "average" of the points, as it'd still be skewed by the poorly distributed points (such as the distribution in image #2). I do understand what you mean though! It's vital the point returned is the one that touches the most nodes, however — Nick Bull, Jun 05 '17 at 09:45
@NickBull: I'd suggest un-editing your question, accepting an answer if there's a good one, or, if not, submitting your own answer. Then inquire about the library using a **new** question. As it is, you've made a moving target, which makes it more difficult for folks to help you. — Richard, Jun 08 '17 at 00:16
@Richard noted, I know a lot of discussion and answers have been provided for the original question, but I think because the original information and images are still relevant to the motivation behind the question (as well as no answer providing enough of an answer for me to honestly accept one), I'd rather keep the information above it and add new knowledge about the problem as an edit than repeat it in a new question. — Nick Bull, Jun 08 '17 at 00:19
@NickBull: Then copy and paste the content to a new question with appropriate modification. The question, as stated, invalidates all existing answers; this isn't a good way to make use of the site. — Richard, Jun 08 '17 at 00:20
@Richard Why are they now irrelevant? The question is still the same - find the greatest point density. Just now I've provided some knowledge I've found, which is that this can be solved with a 2d layered range tree with an efficient counting algorithm. I've just provided some information to you guys to help answer the problem! — Nick Bull, Jun 08 '17 at 00:23
@NickBull: Because you are now asking for a library, not a strategy or algorithm. That's a fundamental change. Yola wouldn't have suggested a quadtree under this new question, I wouldn't have suggested a blocking heuristic, mcdowella's wouldn't have suggested their clever algorithm. — Richard, Jun 08 '17 at 00:28
@Richard Apologies you're right, clarified the edit a bit. I don't mind any other algorithm that can provide me `O(log n)` range counting queries either! By library I meant in place of an algorithm, if there are well known libraries with 2d layered range trees with range counting queries, so I don't rewrite something that's already written and maintained by an authority — Nick Bull, Jun 08 '17 at 00:30
Just a general note: range trees in their typical form support only axis-parallel rectangular range queries, not circular queries — Niklas B., Jun 09 '17 at 09:51

Evgeny Kluev · Accepted Answer · 2017-06-11T20:05:41.053

I suggest using plane sweep algorithm. This allows one-dimensional range queries instead of 2-d queries. (Which is more efficient, simpler, and in case of square neighborhood does not require fractional cascading):

Sort points by Y-coordinate to array S.
Advance 3 pointers to array S: one (C) for currently inspected (center) point; other one, A (a little bit ahead) for nearest point at distance > R below C; and the last one, B (a little bit behind) for farthest point at distance < R above it.
Insert points pointed by A to Order statistic tree (ordered by coordinate X) and remove points pointed by B from this tree. Use this tree to find points at distance R to the left/right from C and use difference of these points' positions in the tree to get number of points in square area around C.
Use results of previous step to select "most surrounded" point.

This algorithm could be optimized if you rotate points (or just exchange X-Y coordinates) so that width of the occupied area is not larger than its height. Also you could cut points into vertical slices (with R-sized overlap) and process slices separately - if there are too many elements in the tree so that it does not fit in CPU cache (which is unlikely for only 1 million points). This algorithm (optimized or not) has time complexity O(n log n).

For circular neighborhood (if R is not too large and points are evenly distributed) you could approximate circle with several rectangles:

In this case step 2 of the algorithm should use more pointers to allow insertion/removal to/from several trees. And on step 3 you should do a linear search near points at proper distance (<=R) to distinguish points inside the circle from the points outside it.

Other way to deal with circular neighborhood is to approximate circle with rectangles of equal height (but here circle should be split into more pieces). This results in much simpler algorithm (where sorted arrays are used instead of order statistic trees):

Cut area occupied by points into horizontal slices, sort slices by Y, then sort points inside slices by X.
For each point in each slice, assume it to be a "center" point and do step 3.
For each nearby slice use binary search to find points with Euclidean distance close to R, then use linear search to tell "inside" points from "outside" ones. Stop linear search where the slice is completely inside the circle, and count remaining points by difference of positions in the array.
Use results of previous step to select "most surrounded" point.

This algorithm allows optimizations mentioned earlier as well as fractional cascading.

Some analysis of the time/space complexities involved would probably be a good idea, since OP seems quite interested in those. — Richard, Jun 11 '17 at 17:59
For rectangular neighborhood O(n log n) worst case time complexity is obvious: n inserts, n deletions, n searches in binary tree. Circular neighborhood in case of small R is also pretty easy: small R allows to assume circle approximation with constant (and small) number of rectangles and relatively cheap linear searches, so complexity remains the same, only with larger constant factor. In general case it might be around O(n log(n) sqrt(r)) or O(n log(n) + n sqrt(r)) with fractional cascading (r = number of points in circle). Precise analysis seems too complicated, so I didn't do it, sorry. — Evgeny Kluev, Jun 11 '17 at 18:54
On second thought the problem of worst-case time complexity for circular neighborhood is not so difficult: worst case data are points aligned in two parallel rows at distance slightly less than R and worst case complexity is O(n^2), which means "circular" algorithms work nice only for evenly distributed points. As for O(n) space complexity, it immediately follows from used data structures. — Evgeny Kluev, Jun 11 '17 at 19:59
Does this not have a complexity of O(log n + k), for k reported points? I can't work out how a plane sweep algorithm could count in O(log n) time — Nick Bull, Jun 14 '17 at 11:09
@NickBull: No, these algorithms only count points, and do not report (or inspect) each point. Plane sweep approach does not count anything; it only simplifies 2d problem into 1d. It is order statistic tree data structure that actually counts the points: it stores the list of points between two Y-coordinates (centerY-R and centerY+R) and allows to determine two positions in this list that correspond to positions of two X-coordinates (centerX-R and centerX+R). Difference between these positions is equal to the number of points in the square. — Evgeny Kluev, Jun 14 '17 at 12:16
Right, fantastic! I'll have a little research and clarify this answer with myself then award you the bounty. — Nick Bull, Jun 14 '17 at 12:25

score 1 · Answer 2 · answered Jun 04 '17 at 05:59

1

I would start by creating something like a https://en.wikipedia.org/wiki/K-d_tree, where you have a tree with points at the leaves and each node information about its descendants. At each node I would keep a count of the number of descendants, and a bounding box enclosing those descendants.

Now for each point I would recursively search the tree. At each node I visit, either all of the bounding box is within R of the current point, all of the bounding box is more than R away from the current point, or some of it is inside R and some outside R. In the first case I can use the count of the number of descendants of the current node to increase the count of points within R of the current point and return up one level of the recursion. In the second case I can simply return up one level of the recursion without incrementing anything. It is only in the intermediate case that I need to continue recursing down the tree.

So I can work out for each point the number of neighbours within R without checking every other point, and pick the point with the highest count.

If the points are spread out evenly then I think you will end up constructing a k-d tree where the lower levels are close to a regular grid, and I think if the grid is of size A x A then in the worst case R is large enough so that its boundary is a circle that intersects O(A) low level cells, so I think that if you have O(n) points you could expect this to cost about O(n * sqrt(n)).

answered Jun 04 '17 at 05:59

mcdowella

19,301
2
19
25

How does this find the point with the most surrounding points in the case of many, many points being grouped within the range of R? If there were a million points within a circle of radius 1/2R, then every point within that circle would have a million points surrounding it within radius R. A kd-tree has a O(n^(1−1/d)+k) (or O(1e6^(1/2)+1e6) = O(1,001,000)) for each of those million range searches. That's still ~1e12 operations - or am I completely missing something? – Nick Bull Jun 04 '17 at 23:05
1

I think you are missing the fact that most of the 1,000,000 points that would be reported by a standard range search lie within sub-trees of the k-d tree so that we don't have to visit them one by one. As soon as we reach the top of a sub-tree where all points are in the range circle (as we can tell from its bounding box) we just increment the count of neighbours by the number of descendants held at the top of the sub-tree. I hope only sqrt(1,000,000) = 1,000 of the points in the circle will be so close to the boundary that we will have to visit them individually. – mcdowella Jun 05 '17 at 04:16
ah I understand! That does make a lot of sense. Going to reconsider what I can bare to sacrifice from a perfect answer to the problem for my use case, then return to this in a couple hours. Your answer has been very helpful! I don't think this answers the question though in regards to the worst-case complexity, although I'll need to think about it. – Nick Bull Jun 05 '17 at 09:49
If you are interested in approximations, you don't need to pursue all the mixed subtrees until they are entirely inside our outside your circle. Using the counts of descendants to weight the probability of choosing random children you can sample at random from the leaves below a given node and use this to estimate the proportion of its descendants that are inside a circle. – mcdowella Jun 05 '17 at 17:11

Richard · Answer 3 · 2017-06-05T22:41:52.367

You can speed up whatever algorithm you use by preprocessing your data in O(n) time to estimate the number of neighbouring points.

For a circle of radius R, create a grid whose cells have dimension R in both the x- and y-directions. For each point, determine to which cell it belongs. For a given cell c this test is easy:

c.x<=p.x && p.x<=c.x+R && c.y<=p.y && p.y<=c.y+R

(You may want to think deeply about whether a closed or half-open interval is correct.)

If you have relatively dense/homogeneous coverage, then you can use an array to store the values. If coverage is sparse/heterogeneous, you may wish to use a hashmap.

Now, consider a point on the grid. The extremal locations of a point within a cell are as indicated:

Points at the corners of the cell can only be neighbours with points in four cells. Points along an edge can be neighbours with points in six cells. Points not on an edge are neighbours with points in 7-9 cells. Since it's rare for a point to fall exactly on a corner or edge, we assume that any point in the focal cell is neighbours with the points in all 8 surrounding cells.

So, if a point p is in a cell (x,y), N[p] identifies the number of neighbours of p within radius R, and Np[y][x] denotes the number of points in cell (x,y), then N[p] is given by:

N[p] = Np[y][x]+
       Np[y][x-1]+
       Np[y-1][x-1]+
       Np[y-1][x]+
       Np[y-1][x+1]+
       Np[y][x+1]+
       Np[y+1][x+1]+
       Np[y+1][x]+
       Np[y+1][x-1]

Once we have the number of neighbours estimated for each point, we can heapify that data structure into a maxheap in O(n) time (with, e.g. make_heap). The structure is now a priority-queue and we can pull points off in O(log n) time per query ordered by their estimated number of neighbours.

Do this for the first point and use a O(log n + k) circle search (or some more clever algorithm) to determine the actual number of neighbours the point has. Make a note of this point in a variable best_found and update its N[p] value.

Peek at the top of the heap. If the estimated number of neighbours is less than N[best_found] then we are done. Otherwise, repeat the above operation.

To improve estimates you could use a finer grid, like so:

along with some clever sliding window techniques to reduce the amount of processing required (see, for instance, this answer for rectangular cases - for circular windows you should probably use a collection of FIFO queues). To increase security you can randomize the origin of the grid.

Considering again the example you posed:

It's clear that this heuristic has the potential to save considerable time: with the above grid, only a single expensive check would need to be performed in order to prove that the middle point has the most neighbours. Again, a higher-resolution grid will improve the estimates and decrease the number of expensive checks which need to be made.

You could, and should, use a similar bounding technique in conjunction with mcdowella's answers; however, his answer does not provide a good place to start looking, so it is possible to spend a lot of time exploring low-value points.