Old answer (if you cannot preprocess the points in advance):
- Inscribe you rectangle in a containing rectangle with sides/edges oriented as the xy axis
- Quickly exclude all points outside of it
- Use the principle explained here: How to determine if a point is in a 2D triangle? with the four sides/edges of the rectangle (Note: since you are always using the same rectanglee to check all the points, you can pre-compute some of the values)
You can maybe gain something (not much, it depends on the orientation of the rectangle) in performance by quickly including points that stays in the inscribed rectangle with sides/edges oriented as the xy axis. This requires some pre-computation, but is negligible given the fact that you have a lot of points.
New answer:
rect
is our input rectangle
- Assume you have
f1(point,rect)
which checks if a point is inside a rectangle. You can use the one I mentioned above.
- Assume you have
f2(shape,rect)
, which is able to say if shape is completely contained in rect, or rect is completely contained in shape, or that shape and rect intersect or do not intersect at all
- shape will be a polygon with a certain number of sides (not high or proportional to n, so that we can assume that
f2
is O(1)
), or a region in the 2D plane delimited by 2 sides and extending to infinite (e.g. the region delimited by the positive section of the xy axis)
- Assume you have much time to preprocess the points, but not infinite. Let's say we can afford a O(n*log(n) ) algorithm
What we want to obtain is an algorithm that at runtime calls f1 and f2 a minimum number of time. For example, something proportional to (of the same order of) log(n)
So we want to divide our 2D plane in m
shapes, each one containing p
points. At runtime, we check each of the shapes with f2, and we can have 4 cases:
- The rectangle is completely contained in the shape: using f1 I count
all the points contained in this shape that lay in the rectangle
(O(p) )
, and I end.
- The shape is completely contained in the
rectangle: I add to my accumulator the whole number of points
contained in the shape.
(O(1) )
- The rectangle and the shape do not
intersect: I skip this shape.
- The rectangle and the shape intersect:
using f1 I count all the points contained in this shape that lay in
the rectangle
(O(p) )
, and I continue.
We can be lucky and drop quickly in case 1, but normally we will have to check all shapes, and for at least one of them we will have to check all the points contained. So this algorithm would be O(p) + O(m)
. Considering that p * m = n
, if we chose p = m = sqrt(n)
we obtain O(sqrt(n) )
which is the best we can obtain with this method. (Note: how many times do we execute f1? This number in fact depends on the shape of rectangle, so if for example the rectangle is very much elongated it will intersect with many regions, causing many calls to f1. However, I think we can assume that the measures of the rectangle are not in the same order of n
, or sqrt(n)
or even log(n)
: n
is huge.)
We could enhance from here; we could for example say that we have adjacencies among the shapes, and the first time I find an overlapping between a shape and the rectangle, I only check contiguous shapes. However, the average number of shapes we will have to check will be anyway around p/2, and O(p/2) = (O(p) )
. So no effective gain.
The real gain is if we put some hierarchy in the shapes.
First of all, I check all my points, and find my bound values max_x, max_y, min_x, min_y. (Let's assume these boundaries are > > n. If we could have priors about the point distribution, the optimizations we could target would be completely different)
We divide our space in shapes each one containing (around) log(n) points. We start by dividing the 2D plane in 4 shapes, using the xy axis (I could also center according to my bound values). This will be our first level of our upside-down pyramid.
Cyclically: For each of the region that contains more than log(n) points, we split the region in half using a vertical or horizontal line (we alternate). If one boundary was set to infinite, to split in half I use the corresponding bound value. Each of the regions that was split contains a pointer to the regions in which it is split. The new regions are the second level of the pyramid. I keep dividing until all of my regions contains (about) log(n) points. When a region is split, it contains pointer to the "children" regions. I have built my pyramid. The top level of the upside-down pyramid contains n/log(n) shapes, which is pretty big, but it doesn't matter: what counts is that we have log(n) pyramid levels. Note: For each shape at each level we know how many points it contains. Note2: this pre-elaboration analyze in average each point one time per pyramid level, so its complexity is O(n*(log(n) ).
When I get my rectangle in input, I use f2 to check the shapes at my first level.
- The rectangle is completely contained in the shape: I enter the children shapes of this region, if any, otherwise I use f1 to count the points inside the rectangle (
O(log(n))
) I discard any other shape.
- The shape is completely contained in the rectangle: I add to my accumulator the whole number of points contained in the shape. Takes
O(1)
- The rectangle and the shape do not intersect: I skip this shape.
- The rectangle and the shape intersect: I enter the children shapes of this region, if any, otherwise I use f1 to count the points inside the rectangle
(O(log(n) )
.
Now the difficult part: how many shapes do we visit? Again, this is dependent on the shape of the rectangle, how many shapes it touches. However, for each level we will visit a number of shapes not depending on n
, so the number of shapes visited is proportional to O(log(n) )
.
Since n is very big we can assume that the number of shapes intersected by the rectangle sides (which cause an expensive call to f1) are far less than O(log(n) )
. The whole algorithm should be O(log(n) )
.
There are further way to optimize, but anything will stay in average O(log(n) )
Final note: The way we divide the space has to be so that the number of sides the polygon have is controlled, because if shapes could have a big number of sides, somehow dependent on the number of points (according to a function that we call g
), f2 would be O(g(n) )
, and its complexity would have to be multiplied again by something depending on n
, the number of shapes we have to check, so probably not good.