Count points in a rectangle

Question

I have lots (billions) of points in 2D which I can preprocess and I would like to answer queries which are of the following form:

Given all four corners of a rectangle, output the number of points inside the rectangle.

The rectangle can be at any orientation (meaning the axis of the rectangle can be oriented at any angle, not just horizontally or vertically).

Is there a fast practical algorithm for this?

Update. Is there any data structure to store the points which allows queries provably to be in sub-linear time?

Update II Seems the answer is a firm no https://cstheory.stackexchange.com/questions/18293/can-we-perform-an-n-d-range-search-over-an-arbitrary-box-without-resorting-to-si. Accepting the most popular answer in any case.

@Aravind That the axis of the rectangle can be oriented at any angle, not just horiz or vert. — ondrejdee, Jul 03 '13 at 13:06
how would you like to describe the four corners of your rectangle? You have 5 degrees of freedom, not 8, as would be the coordinates of four points — Walter Tross, Jul 03 '13 at 13:08
Naive approach - repeat for all points http://stackoverflow.com/questions/2752725/finding-whether-a-point-lies-inside-a-rectangle-or-not — rr-, Jul 03 '13 at 13:10
Well. Difficult problem. If the axis were aligned with x, y then I would recommend (http://en.wikipedia.org/wiki/Summed_area_table) because it is exactly what you need (you are interested just in number of points). It they are not aligned, you could have this at a number of orientations, but then the result could be only approximate. Would that be sufficient? — ondrejdee, Jul 03 '13 at 13:14
@BrunoReis An exact answer is best but if there is a good approximation that could be interesting too. — , Jul 03 '13 at 13:29
@felix, and what about the distribution of the points? Can any assumptions be made? Where do the points come from? — Bruno Reis, Jul 03 '13 at 13:42
@BrunoReis I would like not to make any assumptions about the distribution. — , Jul 03 '13 at 13:43
@felix ok, one more question: do you know beforehand the queries you need answered? Or, can any more assumptions be made on the queries? (I'm trying to see if there's a way to guide the preprocessing based on the queries) — Bruno Reis, Jul 03 '13 at 13:48
@ondav You could have a few of these tables for a few different orientations, then - it doesn't give the optimal solution if **any** angle orientation is allowed (but does with a moderate amount of allowed orientations), it can use quite a bit of space, but it should be guaranteed to be fast. — Bernhard Barker, Jul 03 '13 at 16:06
Check: http://math.stackexchange.com/questions/190111/how-to-check-if-a-point-is-inside-a-rectangle — Khaled.K, Jul 04 '13 at 07:54
@KhaledAKhunaifer I can't afford to go check every point each time I have a new query. — , Jul 04 '13 at 12:02

Bernhard Barker · Accepted Answer · 2018-02-10T15:26:10.567

17

Represent the points as a k-d tree.

That is, a binary tree in which every node represents one of the points and every non-leaf node can be thought of as splitting the current area either vertically or horizontally (alternatingly on each level) on that node's x or y value.

Then, to do a query:

Current node = the root
Current area = the area of the current node (can be kept track of / calculated as you recurse down the tree)
If current area is completely contained inside the rectangle, add the number of points this node has as children (+1 for the current node).
If current area is completely outside inside the rectangle, do nothing.
If current area is partially contained inside rectangle:
- Add the current node's point if it is contained in the rectangle.
- Repeat from 2. for both children.

Calculating whether an area or point is contained in a rectangle should be simple enough.

Each query should take O(log n) time on average on random data.

Although there would exist pathological cases where it would take O(n) time (i.e. where you'll need to explore the entire / most of the tree). One possible example of such a case is when most of the points are around the edges of a (non-axis-aligned) rectangle (either inside or outside), meaning the "do nothing" part mentioned above will rarely apply.

edited Feb 10 '18 at 15:26

answered Jul 03 '13 at 13:17

Bernhard Barker

54,589
14
104
138

@Dukeling Thank you. Do you know what the complexity of this method is? – Jul 03 '13 at 14:22
@felix Best case O(1), worst case O(n), though the worst case should be [pathological](http://en.wikipedia.org/wiki/Pathological_(mathematics)#Computer_science). – Bernhard Barker Jul 03 '13 at 14:25
@Dukeling What's an example worst case? – Jul 03 '13 at 14:25
@felix I think most of the points being around the edges of a (non-axis-aligned) rectangle (either inside or outside). This could require you to explore the entire / most of the tree. – Bernhard Barker Jul 03 '13 at 14:39
1

It should be stated that the average case (for randomly distributed points) is O(log n). This is the reason Dukeling proposed the approach. – Teepeemm Jul 03 '13 at 14:53
@Teepeemm Can that be turned into a randomized algorithm with that running time (for worst case inputs)? Or is that too optimistic? – Jul 03 '13 at 15:04
@felix The only thing you can really randomize here is what the tree looks like, but, for any possible such tree, there will be a O(n) worst case, though different trees could have different inputs which take O(n). So, no, you can't avoid the possibility of an O(n) worst case with this approach. Another algorithm which suffers from bad worst-case performance is [quick-sort](http://en.wikipedia.org/wiki/Quick_sort), but getting to that running time is so improbable that quick-sort is widely used in practice. – Bernhard Barker Jul 03 '13 at 15:20
@Dukeling Ah we may have a different idea about average case running times of randomized algorithms. Quicksort with randomly chosen pivots is O(n log n) time on average *irrespective* of the input distribution. If something similar is possible here that would be great. – Jul 03 '13 at 15:40
@felix You may simply be able to shuffle the points before creating the tree / modifying the construction to use a random point or a point close to the median instead of the median. You could likely write an entire paper on trying to avoid the worst-case. Though it is possible that different trees may have a worst-case caused by the same input rectangle, so it may not help. – Bernhard Barker Jul 03 '13 at 15:56
+1 And if you don't want to reinvent the wheels at all and not even solve the problem of how to persist the structure, use a DBMS. Most of them (Postgres, SQL-Server, Oracle, MySQL) support spatial indexes and queries though kd-trees or R-trees. – ypercubeᵀᴹ Jul 04 '13 at 08:26
@Dukeling http://cstheory.stackexchange.com/questions/18293/can-we-perform-an-n-d-range-search-over-an-arbitrary-box-without-resorting-to-si seems very negative about my question. – Jul 09 '13 at 17:09

score 10 · Answer 2 · edited May 23 '17 at 12:02

Old answer (if you cannot preprocess the points in advance):

Inscribe you rectangle in a containing rectangle with sides/edges oriented as the xy axis
Quickly exclude all points outside of it
Use the principle explained here: How to determine if a point is in a 2D triangle? with the four sides/edges of the rectangle (Note: since you are always using the same rectanglee to check all the points, you can pre-compute some of the values)

You can maybe gain something (not much, it depends on the orientation of the rectangle) in performance by quickly including points that stays in the inscribed rectangle with sides/edges oriented as the xy axis. This requires some pre-computation, but is negligible given the fact that you have a lot of points.

New answer:

rect is our input rectangle
Assume you have f1(point,rect) which checks if a point is inside a rectangle. You can use the one I mentioned above.
Assume you have f2(shape,rect), which is able to say if shape is completely contained in rect, or rect is completely contained in shape, or that shape and rect intersect or do not intersect at all
shape will be a polygon with a certain number of sides (not high or proportional to n, so that we can assume that f2 is O(1)), or a region in the 2D plane delimited by 2 sides and extending to infinite (e.g. the region delimited by the positive section of the xy axis)
Assume you have much time to preprocess the points, but not infinite. Let's say we can afford a O(n*log(n) ) algorithm

What we want to obtain is an algorithm that at runtime calls f1 and f2 a minimum number of time. For example, something proportional to (of the same order of) log(n)

So we want to divide our 2D plane in m shapes, each one containing p points. At runtime, we check each of the shapes with f2, and we can have 4 cases:

The rectangle is completely contained in the shape: using f1 I count all the points contained in this shape that lay in the rectangle (O(p) ), and I end.
The shape is completely contained in the rectangle: I add to my accumulator the whole number of points contained in the shape. (O(1) )
The rectangle and the shape do not intersect: I skip this shape.
The rectangle and the shape intersect: using f1 I count all the points contained in this shape that lay in the rectangle (O(p) ), and I continue.

We can be lucky and drop quickly in case 1, but normally we will have to check all shapes, and for at least one of them we will have to check all the points contained. So this algorithm would be O(p) + O(m). Considering that p * m = n, if we chose p = m = sqrt(n) we obtain O(sqrt(n) ) which is the best we can obtain with this method. (Note: how many times do we execute f1? This number in fact depends on the shape of rectangle, so if for example the rectangle is very much elongated it will intersect with many regions, causing many calls to f1. However, I think we can assume that the measures of the rectangle are not in the same order of n, or sqrt(n) or even log(n): n is huge.)

We could enhance from here; we could for example say that we have adjacencies among the shapes, and the first time I find an overlapping between a shape and the rectangle, I only check contiguous shapes. However, the average number of shapes we will have to check will be anyway around p/2, and O(p/2) = (O(p) ). So no effective gain.

The real gain is if we put some hierarchy in the shapes.

First of all, I check all my points, and find my bound values max_x, max_y, min_x, min_y. (Let's assume these boundaries are > > n. If we could have priors about the point distribution, the optimizations we could target would be completely different) We divide our space in shapes each one containing (around) log(n) points. We start by dividing the 2D plane in 4 shapes, using the xy axis (I could also center according to my bound values). This will be our first level of our upside-down pyramid. Cyclically: For each of the region that contains more than log(n) points, we split the region in half using a vertical or horizontal line (we alternate). If one boundary was set to infinite, to split in half I use the corresponding bound value. Each of the regions that was split contains a pointer to the regions in which it is split. The new regions are the second level of the pyramid. I keep dividing until all of my regions contains (about) log(n) points. When a region is split, it contains pointer to the "children" regions. I have built my pyramid. The top level of the upside-down pyramid contains n/log(n) shapes, which is pretty big, but it doesn't matter: what counts is that we have log(n) pyramid levels. Note: For each shape at each level we know how many points it contains. Note2: this pre-elaboration analyze in average each point one time per pyramid level, so its complexity is O(n*(log(n) ).

When I get my rectangle in input, I use f2 to check the shapes at my first level.

The rectangle is completely contained in the shape: I enter the children shapes of this region, if any, otherwise I use f1 to count the points inside the rectangle (O(log(n))) I discard any other shape.
The shape is completely contained in the rectangle: I add to my accumulator the whole number of points contained in the shape. Takes O(1)
The rectangle and the shape do not intersect: I skip this shape.
The rectangle and the shape intersect: I enter the children shapes of this region, if any, otherwise I use f1 to count the points inside the rectangle (O(log(n) ).

Now the difficult part: how many shapes do we visit? Again, this is dependent on the shape of the rectangle, how many shapes it touches. However, for each level we will visit a number of shapes not depending on n, so the number of shapes visited is proportional to O(log(n) ).

Since n is very big we can assume that the number of shapes intersected by the rectangle sides (which cause an expensive call to f1) are far less than O(log(n) ). The whole algorithm should be O(log(n) ).

There are further way to optimize, but anything will stay in average O(log(n) )

Final note: The way we divide the space has to be so that the number of sides the polygon have is controlled, because if shapes could have a big number of sides, somehow dependent on the number of points (according to a function that we call g), f2 would be O(g(n) ), and its complexity would have to be multiplied again by something depending on n, the number of shapes we have to check, so probably not good.

Or more simply. Exclude all the points outside of the inscribed rectangle and just check the remaining ones to see if they are in the rectangle. The problem with this is that it is linear time in the worst case and just generally very slow if you happen to have a lot of points in your rectangle. I only want to count them, not list them all. — , Jul 03 '13 at 13:38
@felix I do not see any difference if the problem is listing or counting the points — Antonio, Jul 03 '13 at 13:42
Potentially you could have a constant time solution for counting where listing has to be at least linear in the number of point in the rectangle. — , Jul 03 '13 at 13:47
You can just adapt the triangle algorithm to work with rectangles. You don't need to split the rectangle into two triangles. But you still have the OP's problem of the algorithm being linear in time. — Teepeemm, Jul 03 '13 at 14:11
@Teepeemm In which way the proposed method is not linear with the number of points? — Antonio, Jul 03 '13 at 14:30
`"Quickly exclude all points outside of it"` - how? Go through all the points and check? (which is consistently linear, thus not so quick) — Bernhard Barker, Jul 03 '13 at 14:42
@Dukeling If you can check each point with less than 4 comparisons in a general way, congratulations. I will study your answer tonight. — Antonio, Jul 03 '13 at 14:46
The algorithm is checking that each point lies on the correct half of each side of the triangle (or rectangle, in this case). But it has to go through each point, hence linear. — Teepeemm, Jul 03 '13 at 14:56
@Teepeemm Yeah, but in my understanding linear is good. I mean, it cannot be otherwise: you have to do at least 1 operation per point. You can try to reduce the number of operations per point, but it will always be linear (at least). — Antonio, Jul 03 '13 at 15:08
@Antonio That's not right. For example if the rectangle were axis parallel you can certainly count the number of points in it in much less than linear time. The key idea is to preprocess the points into some data structure before the queries arrive. The preprocessing will take linear time of course but that is not counted in the query time. — , Jul 03 '13 at 15:19
@felix: So, what do you know in advance, the points or the rectangle? I thought the points where part of the input, not pre-defined, pre-elaborated and then tested against many rectangles. Ok, now I see I had missed that part of the question :P I will edit my answer tonight... — Antonio, Jul 03 '13 at 15:25
You know the points in advance. You get to preprocess them. Then the queries (rectangles) arrive and you want the time to answer each query to be fast. — , Jul 03 '13 at 15:42
@felix See updated answer, I hope it doesn't contain too many mistakes. The demonstration about the final complexity would need some work, I am open to suggestions. — Antonio, Jul 03 '13 at 22:08
It would be great if it were O(log n) in the worst case! I find that hard to verify from your description however. Can anyone else verify this? — , Jul 04 '13 at 08:01
@felix It's `O(log n)` also in the worst case. What matters is how many shapes are crossed by the edges of the rectangle (because they trigger more computation), so the computation time will be somehow proportional to the perimeter of the rectangle. But even the worst rectangle will be able to have edges that cross a number of region proportional to `log(n)` (when `n` is big). This from intuitive considerations; mathematically it might be complicated to demonstrate. — Antonio, Jul 04 '13 at 11:34
It seems it can't be O(log n) time http://cstheory.stackexchange.com/questions/18293/can-we-perform-an-n-d-range-search-over-an-arbitrary-box-without-resorting-to-si . — , Jul 09 '13 at 17:10
@felix I do not see an impact, for the problem you proposed, if the rectangle is rotated (it will increase the number of checks, but not esponentially), which is the point of the link you just posted. I do not think the problem shown at the link is the same we have here; one assumption I made is that the ratio between the rectangle perimeter and the area where the points are spread is very low. — Antonio, Jul 09 '13 at 20:06

score 3 · Answer 3 · edited May 23 '17 at 11:54

What you need is some sort of binary space partitioning data structure. That'll get you a list of candidates for which you could do the real "point in polygon" test.

I'd advise you to make sure that this is something you really ought to be coding on your own. For example, many DBs have this sort of functionality built in. Does your data actually reside in a DB? Could it? (no sense in reinventing wheels...)

You can see a great answer for the Point in Polygon problem here: How can I determine whether a 2D Point is within a Polygon?

score 3 · Answer 4 · answered Jul 03 '13 at 13:10

3

I'd suggest finding a rotation+shift transformation that you can apply to your space, so that one corner of the rectangle is in (0,0) and two edges go along the x and y axes.

Now you go through the points, apply the same transformation and just check for 0 < x < rectangle_max_x and 0 < y < rectangle_max_y.

answered Jul 03 '13 at 13:10

viraptor

33,322
10
107
191

This is linear complexity. Probably too slow. – ondrejdee Jul 03 '13 at 13:12
Maybe too slow - you can always go the way of tree building, But (US) billions of points is not that much to go through. Depends on how many points per second is needed. – viraptor Jul 03 '13 at 13:14
I agree both with @ondav and with viraptor :-) – Walter Tross Jul 03 '13 at 13:15
@viraptor how do you transform a tree? – ondrejdee Jul 03 '13 at 13:16
@ondav I meant tree instead of transformation, not together :) Basically, you can do instead what Assaf suggested in another answer. – viraptor Jul 03 '13 at 13:18
I am looking for something that is very much sub-linear time per query. I plan on doing lots of queries. – Jul 03 '13 at 13:31
Ok, in that case, definitely do not use my answer ;) Dukeling proposed a pretty much standard approach for doing things like that in the 3d rendering world which should give you sub-linear time. – viraptor Jul 03 '13 at 13:34

score 1 · Answer 5 · answered Jul 04 '13 at 08:38

1

Make triangle. Suppose, abcd is the rectangle and x is the point then if area(abx)+area(bcx)+area(cdx)+area(dax) equals area(abcd) then the point is inside it.

answered Jul 04 '13 at 08:38

pcbabu

2,219
4
22
32

score 0 · Answer 6 · answered Jul 03 '13 at 13:23

0

You can use a BoundBox for the rectangle then if a point is inside the boundbox you can check if it collide with the rectangle or you can use oriented bounded box.

This is the simplest way and don't need use of complex data structure

answered Jul 03 '13 at 13:23

Elvis Dukaj

7,142
12
43
85

TimFoolery · Answer 7 · 2013-07-03T14:25:45.957

If speed is an issue but memory/diskspace is not, consider doing the following, which should be the most efficient methods possible.

This way you can perform some very fast tests prior to ever doing any significant math:

public class DataPoint
{
    double X, Y;
    ...
}
public bool IsInBoundingBox(Point p1, Point p2, Point p3, Point p4)
{
    // assume p1, p2, p3, p4 to be sorted
    return (X>p1.X && X<p3.X && Y>p4.Y && Y<p2.Y);
}

Then the order of actually doing the work should be this...

// sort points of QueryRectangle: p1 is left-most, p2 is top-most, p3 is right-
// most, and p4 to be bottom-most; if there is a tie for left-most, p1 should
// be the bottom-left corner, p2 the top-left corner, p3 the top-right corner,
// and p4 the bottom-right corner

// See if the QueryRectangle in question is aligned with the grid; if it is,
// then the set of DataPoints that lie within the BoundingBox are within the
// QueryRectangle and no further calculation is needed

if (p1.X == p2.X || p1.X == p3.X || p1.X == p4.X) 
{
    // is orthogonal (aligned with axes)
    foreach(DataPoint dp in listDataPoints)
        if(dp.IsInBoundingBox())
        {
            // dp is in QueryRectangle; perform work
        }
}
else
{   
    foreach(DataPoint dp in listDataPoints)
        if(dp.IsInBoundingBox())
        {
            // perform further testing to see if dp is in QueryRectangle
        }
}

Alternatively, if you want to go with a rotation/translation solution as viraptor suggests...

// sort points of QueryRectangle: p1 is left-most, p2 is top-most, p3 is right-
// most, and p4 to be bottom-most; if there is a tie for left-most, p1 should
// be the bottom-left corner, p2 the top-left corner, p3 the top-right corner,
// and p4 the bottom-right corner

public class DataPointList : List<DataPoint>
{
    public List<DataPoint> GetPointsInRectangle(Point p1, Point p2, Point p3, Point p4)
    {
        List<DataPoint> tempListDataPoints = new List<DataPoint>();
        foreach(DataPoint dp in this)
            if(dp.IsInBoundingBox()) tempListDataPoints.Add(dp);

        if (!(p1.X == p2.X || p1.X == p3.X || p1.X == p4.X))
        {
            // needs transformation
            tempListDataPoints.TranslateAll(-1 * p1.X, -1 * p1.Y);
            tempListDataPoints.RotateAll(/* someAngle derived from the arctan of p1 and p2 */);
            // Note: you should be rotating counter-clockwise by some angle >0 but <90

            // the new p1 will be 0,0, but p2, p3, and p4 all need to undergo the same transformations
            // transP1 = new Point(0,0);
            // transP2 = new Point(p2.Translate(-1 * p1.X, -1 * p1.Y).Rotate(/* someAngle derived from the arctan of p1 and p2 */));
            // transP3 = ...; transP4 = ...;

            foreach(DataPoint dp in tempListDataPoints)
                if (!(dp.X>transP1.X && dp.X<transP3.X && dp.Y>transP1.Y && dp.Y<transP3.Y)) tempListDataPoints.Remove(dp);
        }
        else
        {
            // QueryRectangle is aligned with axes, all points in bounding box
            // lie within the QueryRectangle, no need for transformation or any
            // further calculation

            // no code needs to go here, but you may want to keep it around for notes
        }

        return tempListDataPoints;
    }
}

Alternatively, you could do the above code with an array. I'll leave figuring that out up to you...

Disclaimer: I got 2 hours of sleep last night, so I'm not going to proofread. You may need to do some minor fixes. Or major ones. Who knows. :)

@Teepeemm O(n)? Yegh... big O notation is sorta overrated IMO. The only way that I can think of to make it even quicker is to load the points into sectors based on the full range of the data points. Now that I've read the more current answers of others, this seems to be pretty much what Dukeling is proposing. Regardless of k-d tree vs O(n) bounding box comparisons (in fact, they are not mutually exclusive, as you can do a bounding box comparison on data points retrieved from the k-d tree), there are several optimizations to be done, and some of them are featured in my code, not just (cont...) — TimFoolery, Jul 03 '13 at 15:35
bounding box: that an orthogonal QueryRectangle should not be translated and then rotated, for example... translating is a waste of time if there is no rotation. — TimFoolery, Jul 03 '13 at 15:37

score 0 · Answer 8 · answered Jul 04 '13 at 06:41

0

A simplification you can make to solve your problem is to find the minimum axis-aligned rectangle (S) containing the one given (R). Use some spatial tree structure as a k-d tree to find the subset of points contained inside S and finally, select for that subset the points that are inside R.

That approach will be way easier to implement than the one proposed by @Dukelin where the k-d tree search is performed directly using R.

answered Jul 04 '13 at 06:41

salva

9,943
4
29
57

Thank you. The problem is that R might be contain a large proportion of all the points. – Jul 04 '13 at 07:59
@felix, that's an unavoidable issue as any algorithm will be at least `O(|set_of_points_contained_in(R)|)`. The real problem for the approach I have proposed is when `|set_of_points_contained_in(S)|` is much bigger than `|set_of_points_contained_in(R)|`. – salva Jul 04 '13 at 08:19
I am not sure your claim is true. If the query rectangle were axis parallel then you could solve the problem much more quickly than O(|set_of_points_contained_in(R)|) by suitable preprocessing of the points. – Jul 04 '13 at 12:04
@felix: just returning the result is O(|set_of_points_contained_in(R)|). – salva Jul 04 '13 at 12:57

akalenuk · Answer 9 · 2013-07-04T08:41:11.023

I would start with presorting the points array along any axis (let it be x). Then binary searching it for the points with x-projection in rectangles' bounding box x-projection. That along would of reduce the numbers of points to check drastically.

Then we can filter the points further just by checking if they are in rectangles bounding box. But yes, that would be linear.

Then we may take a transformation matrix for the rectangle (I assume we already have it). Rectangle is an affine transformation of singular 2-cube, therefore we can find reverse transformation without calculating an actual inverse matrix.

For direct transform matrix of

A D a
B E b
C F c

the solution would be:

d = 1/(AE − BD)

A' = Ed
B' = −Bd
C' = (BF − EC)d

D' = −Dd
E' = Ad
F' = (DC − AF)d

a' = b' = 0
c' = 1

Then, by applying inverse transform to every point, we will either translate it into a singular cube, which is (0, 1)x(0, 1), if it originally lies in a rectangle, or not if it doesn't.

UPD: Or, instead of all the transformation stuff, you can do the following:

Let points of rectangle be P1..P4 and the point to check A.

For i = 1..4 calculate PAi as Pi - A

Now the cross product of (Pi.x, Pi.y, 0)x(Pj.x, Pj.y, 0) would measure the triangle made by an A and corresponding rectangles' edge. And, as original point are all on xy plane, the result would be like (0, 0, Sij), where Sij is the signed square of a triangle. Just calculate the sum:

|(P1.x, P1.y, 0)x(P2.x, P2.y, 0)[3]|+
|(P2.x, P2.y, 0)x(P3.x, P3.y, 0)[3]|+
|(P3.x, P3.y, 0)x(P4.x, P4.y, 0)[3]|+
|(P4.x, P4.y, 0)x(P1.x, P1.y, 0)[3]|

And compare it with the rectangles square. If it is more or less equal, then the point is in the rectangle. There would be some small computational error, so the exact equality is out of the question.

Count points in a rectangle

9 Answers9

Linked