Optimizing 2D grid connectivity algorithm

Question

Summary: I'm looking for an optimal algorithm to ensure connectivity over a 2D grid of binary values. I have a fairly involved algorithm that does it in effectively linear time, but only if certain pre-processing steps are performed. The following goes into fairly extensive detail about the algorithm and its run time. I've also put together a Unity app that offers a detailed visualization of all the steps mentioned below (and some others), which can be found here.

I have a set of scripts that procedurally generate terrain using an algorithm called marching squares. One of the steps is to connect all the regions together. Specifically, I have a grid of 0s (floors) and 1s (walls) and want to ensure that every 0 is reachable from every other 0. I'm optimizing for:

The amount of tunneling that needs to be done. i.e. the number of 1s that are turned into 0 should be minimized.
Asymptotic run-time. I'm trying to make it linear in the number of tiles in the grid, or as close to linear as possible.

By treating the rooms (connected regions of 0s) as vertices and potential tunnels as edges, we can use a minimum spanning tree algorithm as our workhorse. I'll describe the algorithm from the starting point of an unconnected grid of 0s and 1s.

Input:

A 2d array of bytes, either 0 or 1, representing terrain (0: floor, 1: wall).

e.g. the following has four 'rooms' (connected components of floors).

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 1 
1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 1 1 1 1 
1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 
1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 
1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 
1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 
1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 
1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Output

The same grid, such that each room can be reached from any other room, with the least amount of damage done to the grid (fewest 1s flipped to 0s). Here we've carved a total of three tunnels:

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 1 
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 1 
1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 
1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 
1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 
1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 
1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 
1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Overview of basic algorithm:

The following is a high level description of the algorithm, without several crucial optimizations, in order to illustrate the high level ideas:

Run a BFS on the grid to find rooms (connected components of floors), storing only edge tiles (i.e. floor tiles adjacent to a wall tile) since the shortest path between two rooms will always be between two edge tiles.
For each pair of rooms, do a double loop to find the pair of tiles with the shortest euclidean distance. This pair forms a potential tunnel between the two rooms, by digging a straight path between them.
Treat the rooms from (1) as vertices, and the pairs from (2) as edges in a graph, with the weights being euclidean distance. Run Kruskal's Minimum Spanning Tree algorithm on this graph to acquire a list of tunnels to dig that minimize the number of tiles that need to be changed.

This guarantees connectivity, and it guarantees the absolute minimum number of changed tiles (caveat: this is false if we consider the possibility of connecting a room not to another room, but to a tunnel between another pair of rooms). The issue is that it scales poorly: step (2) scales quadratically in the number of tiles in the grid.

Optimized algorithm

The bottleneck is with step (2). We're meticulously checking every single pair of tiles for every pair of rooms to ensure we get the absolute smallest connection. If we accept a little bit of error (i.e. a suboptimal connection) we can speed it up dramatically. The basic idea is to skip a number of tiles proportional to the distance we just computed: if we compute a large distance between tile A and tile B, then chances are that we're nowhere near an optimal connection, so we can skip checking the nearby tiles. Any error from this skipping will be proportional to the length of the optimal connection.

To explain this visually, suppose X and Y represent a current pair of tiles being checked, and that we're currently looping X over the room on the left.

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 X 0 0 0 0 0 0 1 1 1 1 1 1 1 
1 1 0 0 0 0 0 0 0 1 1 1 1 1 1
1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 
1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 
1 1 0 0 0 0 1 1 0 0 0 0 0 1 1 
1 1 1 1 1 1 1 0 0 0 0 Y 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

These have a distance of 11, so let's skip 11 tiles (marked with dashes):

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 0 - - - - - - 1 1 1 1 1 1 1 
1 1 0 0 0 0 0 - - 1 1 1 1 1 1
1 1 0 0 0 0 - - 1 1 1 1 1 1 1 
1 0 0 0 0 X - 1 1 1 1 0 0 0 1 
1 1 0 0 0 0 1 1 0 0 0 0 0 1 1 
1 1 1 1 1 1 1 0 0 0 0 Y 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Most comparisons will be far apart, so this dramatically reduces the run time of step 2, and in practice, produces minimal error.

The one issue being overlooked here is that this assumes the edge tiles are ordered: this requires an additional preprocessing step.

Thus here's the optimized algorithm:

Perform the BFS as in the previous algorithm.
Sort each room. This can be done by executing a depth-first along the edge tiles. This will give us a roughly continuous path along the edge of the room. There are a few (literal) corner cases where the path can jump, but the paths are continuous within reasonable approximation.
Perform the double loop in step 2 from the previous algorithm, but this time, instead of incrementing by one tile, increment by the last computed distance.
Perform Kruskal's algorithm as before. Note that Kruskal's requires we sort the edges in the graph: since the graph is complete (every pair of rooms has a potential tunnel), a standard sort becomes the new bottleneck in the algorithm. Since we're sorting by distance, which is a float, we can achieve a much faster sort by truncating the floats and turning them into integers. Again, this produces minimal error (Kruskal's might choose a tunnel with distance 4.6 over a tunnel with distance 4.2, for example) but offers dramatic speedup.

Run-time analysis

Let n denote the number of tiles in the grid. Let m denote the number of rooms (connected components) in the grid. Note that in general, in the worst case, m = O(n).

There are four steps in the algorithm. (1) and (2) are both O(n) in memory and time, as they are a BFS and DFS that processes each tile in the map at most once.

(3) is a bit trickier. We're doing a double loop over every room, and then finding a connection. This is O(m^2) for the double loop, multipled by the average work done per pair of rooms. Giving a tight analytical bound for the work done on average per pair of rooms is not a straightforward matter. But empirically, testing over a large variety of configurations, as n grows large, the average work converges to a single comparison. This is because the average distance of tiles between rooms grows as the grid grows.

So in total the work done for (3) is O(m^2), with O(1) storage.

For (4), it's given by the runtime for Kruskal's, which is O(m^2 logm^2) to sort the edges naively and O(m^2 a(m^2)) to run the edges through the UnionFind data structure, where a is the inverse ackermann function (effectively a constant). If we truncate the edge lengths and use an integer sorting algorithm, we can get the sort down to O(m^2). Storage is O(m^2).

So in total, the runtime is dominated by Kruskal's algorithm, given by O(m^2 a(m^2)), or effectively, O(m^2). Given that m = O(n) in the worst case, this is not very good performance. But we can do one final preprocessing step on the grid to get it down to O(n), which is to limit the number of rooms in an organic way.

Prior to any of the other steps, we can use a floodfill algorithm to fill in small rooms in linear time: specifically, we can fill in any room of size less than sqrt(n). Since there can only be at most sqrt(n) rooms of size at least sqrt(n) in the grid, it follows that m = O(sqrt(n)), making the entire algorithm linear in the size of the grid.

Conclusion

Is it possible to do better than this? Obviously we cannot do asymptotically better than linear time and storage, but in order to achieve those figures, a certain amount of sloppily quantified suboptimality in the tunnel lengths is accepted, and it requires modifying the original grid (namely, putting a bound on the number of rooms).

For shortest connections between rooms, see [this question](http://stackoverflow.com/questions/3700983/what-is-the-fastest-algorithm-to-calculate-the-minimum-distance-between-two-sets) (check room for convexity first => optimization possible). Instead of computing a complete connected graph over all rooms, you might want to build local sub-graphs and build tunnels for those, then build tunnels between sub-graphs... — le_m, Apr 02 '17 at 16:53

Optimizing 2D grid connectivity algorithm

0 Answers0