13

I have a set of points which are contained within the rectangle. I'd like to split the rectangles into subrectangles based on point density (giving a number of subrectangles or desired density, whichever is easiest).

The partitioning doesn't have to be exact (almost any approximation better than regular grid would do), but the algorithm has to cope with the large number of points - approx. 200 millions. The desired number of subrectangles however is substantially lower (around 1000).

Does anyone know any algorithm which may help me with this particular task?

Josh Lee
  • 171,072
  • 38
  • 269
  • 275
Karol Kolenda
  • 1,660
  • 2
  • 25
  • 37
  • 1
    Do you want (approximately) the same number of points in each subrectangle ? I think you do, but it's not entirely clear. – High Performance Mark Jun 02 '10 at 16:28
  • Yes, I want the same number of points in each subrect. – Karol Kolenda Jun 02 '10 at 16:33
  • 5
    If you do not impose restrictions on the form of the rectangles, the problem is one dimensional, and all you have to do is to cut a segment with the same number of points (x coordinate of the points). It seems to me that you are forgetting some "topology" restriction – Dr. belisarius Jun 02 '10 at 17:44
  • Yes belisarius, you are right. I want within the rectangle to be spatially close to one another, so slicing the initial rectangle into vertical or horizontal stripes is not what I'm looking for. – Karol Kolenda Jun 02 '10 at 20:30
  • Do you want an uneven grid, or just a set of mutually-exclusive covering rectangles? – BlueRaja - Danny Pflughoeft Jun 02 '10 at 21:24
  • @Karol I understood the "slicing the initial rectangle into vertical or horizontal stripes is not what I'm looking for" part, but I find myself not able to translate the affirmative ("I want within the rectangle to be spatially close to one another") to a specification formally enough to try to design an algorithm. Perhaps you should try to formalize the requirement or include a sample graphic of the desired output – Dr. belisarius Jun 03 '10 at 17:38
  • Well, I don't need any specific heuristic - but the rectangles are part of the external spatial indexing engine which I need to populate with points. Therefore thin slices won't be very efficient during window querying (where a window has usually 4:3 ratio). – Karol Kolenda Jun 03 '10 at 20:20

8 Answers8

2

R-tree

Pete Kirkham
  • 48,893
  • 5
  • 92
  • 171
Doug Currie
  • 40,708
  • 1
  • 95
  • 119
2

Just to understand the problem. The following is crude and perform badly, but I want to know if the result is what you want>

Assumption> Number of rectangles is even
Assumption> Point distribution is markedly 2D (no big accumulation in one line)

Procedure>
Bisect n/2 times in either axis, looping from one end to the other of each previously determined rectangle counting "passed" points and storing the number of passed points at each iteration. Once counted, bisect the rectangle selecting by the points counted in each loop.

Is that what you want to achieve?

Dr. belisarius
  • 60,527
  • 15
  • 115
  • 190
  • Yes, exactly. That would be perfect. I was thinking about initially create a matrix (1000x1000 or so) and store number of points in each cell. Then I can use your bisecting algorithm. – Karol Kolenda Jun 02 '10 at 16:53
2

I think I'd start with the following, which is close to what @belisarius already proposed. If you have any additional requirements, such as preferring 'nearly square' rectangles to 'long and thin' ones you'll need to modify this naive approach. I'll assume, for the sake of simplicity, that the points are approximately randomly distributed.

  1. Split your initial rectangle in 2 with a line parallel to the short side of the rectangle and running exactly through the mid-point.
  2. Count the number of points in both half-rectangles. If they are equal (enough) then go to step 4. Otherwise, go to step 3.
  3. Based on the distribution of points between the half-rectangles, move the line to even things up again. So if, perchance, the first cut split the points 1/3, 2/3, move the line half-way into the heavy half of the rectangle. Go to step 2. (Be careful not to get trapped here, moving the line in ever decreasing steps first in one direction, then the other.)
  4. Now, pass each of the half-rectangles in to a recursive call to this function, at step 1.

I hope that outlines the proposal well enough. It has limitations: it will produce a number of rectangles equal to some power of 2, so adjust it if that's not good enough. I've phrased it recursively, but it's ideal for parallelisation. Each split creates two tasks, each of which splits a rectangle and creates two more tasks.

If you don't like that approach, perhaps you could start with a regular grid with some multiple (10 - 100 perhaps) of the number of rectangles you want. Count the number of points in each of these tiny rectangles. Then start gluing the tiny rectangles together until the less-tiny rectangle contains (approximately) the right number of points. Or, if it satisfies your requirements well enough, you could use this as a discretisation method and integrate it with my first approach, but only place the cutting lines along the boundaries of the tiny rectangles. This would probably be much quicker as you'd only have to count the points in each tiny rectangle once.

I haven't really thought about the running time of either of these; I have a preference for the former approach 'cos I do a fair amount of parallel programming and have oodles of processors.

High Performance Mark
  • 77,191
  • 7
  • 105
  • 161
  • Thinking again and summing up both answers, cutting thin slice rectangles is equivalent to one dimensional problem. So, if you have N points and r rectangles, just take in account the x coordinate of the points and sum up N/r points in each partition – Dr. belisarius Jun 02 '10 at 17:34
2

You're after a standard Kd-tree or binary space partitioning tree, I think. (You can look it up on Wikipedia.)

Since you have very many points, you may wish to only approximately partition the first few levels. In this case, you should take a random sample of your 200M points--maybe 200k of them--and split the full data set at the midpoint of the subsample (along whichever axis is longer). If you actually choose the points at random, the probability that you'll miss a huge cluster of points that need to be subdivided will be approximately zero.

Now you have two problems of about 100M points each. Divide each along the longer axis. Repeat until you stop taking subsamples and split along the whole data set. After ten breadth-first iterations you'll be done.

If you have a different problem--you must provide tick marks along the X and Y axis and fill in a grid along those as best you can, rather than having the irregular decomposition of a Kd-tree--take your subsample of points and find the 0/32, 1/32, ..., 32/32 percentiles along each axis. Draw your grid lines there, then fill the resulting 1024-element grid with your points.

Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
  • Maybe I'm missing the point of kd-trees, but they don't seem to create rectangles with points INSIDE the rectangles as the question states. The second comment says he's looking for equal number of points within a rectangle. The reference contained many other references that I thought might be useful. – Grembo Jun 02 '10 at 19:07
  • @Grembo - It's a small implementation detail to switch to making the lines _between_ points instead of _on_ points. – Rex Kerr Jun 02 '10 at 19:22
  • KD-Tree was indeed my first reflex too, I balked at first because of the huge set but the sampling approach alleviates the issue. – Matthieu M. Jun 03 '10 at 14:19
0

Good question.

I think the area you need to investigate is "computational geometry" and the "k-partitioning" problem. There's a link that might help get you started here

You might find that the problem itself is NP-hard which means a good approximation algorithm is the best you're going to get.

Grembo
  • 1,223
  • 7
  • 6
  • The poster has asked for equipartitioning of points, not minimizing or maximizing interactions. The latter can be NP-hard (e.g. graph max cut); the former is `O(n log n)`. – Rex Kerr Jun 02 '10 at 18:45
0

Would K-means clustering or a Voronoi diagram be a good fit for the problem you are trying to solve?

fbrereto
  • 35,429
  • 19
  • 126
  • 178
0

That's looks like Cluster analysis.

ony
  • 12,457
  • 1
  • 33
  • 41
  • In general you are right - it is a cluster analysis problem. But it is quite unusual problem so I don't think that ready-to-use methods from the cluster analysis toolbox will useful. I'm only interested in rectangles - not any clusters, I have lots of object and I don't care that much about precision. – Karol Kolenda Jun 02 '10 at 20:37
0

Would a QuadTree work?

A quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are most often used to partition a two dimensional space by recursively subdividing it into four quadrants or regions. The regions may be square or rectangular, or may have arbitrary shapes. This data structure was named a quadtree by Raphael Finkel and J.L. Bentley in 1974. A similar partitioning is also known as a Q-tree. All forms of Quadtrees share some common features:

  • They decompose space into adaptable cells
  • Each cell (or bucket) has a maximum capacity. When maximum capacity is reached, the bucket splits
  • The tree directory follows the spatial decomposition of the Quadtree
David Basarab
  • 72,212
  • 42
  • 129
  • 156