Is there an efficient way to count dots in cells?

Question

I have graphs of sets of points like:-

There are up to 1 million points on each graph. You can see that the points are scattered over a grid of cells, each sized 200 x 100 units. So there are 35 cells shown.

Is there an efficient way to count how many points there are in each cell? The brute force approach seems to be to parse the data 35 times with a whole load of combined is less or greater than statements.

What language are you using? Or are you asking for a high level approach on how do to this? — Water, Feb 17 '19 at 00:04
@Water High level will do, thanks. I can handle C++, Java and Python so that should cover it. — Paul Uszak, Feb 17 '19 at 00:18

Water · Accepted Answer · 2019-02-17T01:14:11.657

Some of the steps below could be optimized in the sense that you could perform some of these as you build up the data set. However I'll assume you are just given a series of points and you have to find which cells they fit into. If you can inject your own code into the step that builds up the graph, you could do the stuff I wrote below along side of building the graph instead of after the fact.

You're stuck with brute force in the case of just being given the data, there's no way you can know otherwise since you have to visit each point at least once to figure out what cell it is in. Therefore we are stuck with O(n). If you have some other knowledge you could exploit, that would be up to you to utilize - but since it wasn't mentioned in the OP I will assume we're stuck with brute force.

The high level strategy would be as follows:

// 1) Set rectangle bounds to have minX/Y at +inf, and maxX/Y to be -inf
// or initialize it with the first point

// 2) For each point:
//       Set the set the min with min(point.x, bounds.min.x)
//       Same for the max as well

// 3) Now you have your bounds, you divide it by how many cells fit onto each
// axis while taking into account that you might need to round up with division
// truncating the results, unless you cast to float and ceil()
int cols = ceil(float(bounds.max.x - bounds.min.x) / CELL_WIDTH);
int rows = ceil(float(bounds.max.y - bounds.min.y) / CELL_HEIGHT);

// 4) You have the # of cells for the width and height, so make a 2D array of
// some sort that is w * h cells (each cell contains 32-bit int at least) and
// initialize to zero if this is C or C++

// 5) Figure out the cell number by subtracting the bottom left corner of our
// bounds (which should be the min point on the x/y axis that we found from (1))
for (Point p in points):
    int col = (p.x - minX) / cellWidth;
    int row = (p.y - minY) / cellHeight;
    data[row][col]++;

Optimizations:

There are some ways we might be able to speed this up off the top of my head:

If you have powers of two with the cell width/height, you could do some bit shifting. If it's a multiple of ten, this might possibly speed things up if you aren't using C or C++, but I haven't profiled this so maybe hotspot in Java and the like would do this for you anyways (and no idea about Python). Then again 1 million points should be pretty fast.
We don't need to go over the whole range at the beginning, we could just keep resizing our table and adding new rows and columns if we find a bigger value. This way we'd only do one iteration over all the points instead of two.
If you don't care about the extra space usage and your numbers are positive only, you could avoid the "translate to origin" subtraction step by just assuming everything is already relative to the origin and not subtract at all. You could get away with this by modifying step (1) of the code to have the min start at 0 instead of inf (or the first point if you chose that). This might be bad however if your points are really far out on the axis and you end up creating a ton of empty slots. You'd know your data and whether this is possible or not.

There's probably a few more things that can be done but this would get you on the right track to being efficient with it. You'd be able to work back to which cell it is as well.

EDIT: This assumes you won't have some really small cell width compared to the grid size (like your width being 100 units, but your graph could span by 2 million units). If so then you'd need to look into possibly sparse matrices.

Is there an efficient way to count dots in cells?

1 Answers1