12

I am trying to locate millions of points inside a half-dozen polygons. Here's my code:

def find_shape(longitude,latitude):
    if longitude != 0 and latitude != 0:
        point = shapely.geometry.Point(longitude,latitude)
    else:
        return "Unknown"
    for current_shape in all_shapes:
        if current_shape['bounding_box'].contains(point):
            if current_shape['shape'].contains(point):
                return current_shape['properties']['ShapeName']
                break
    return "Unknown"

I've read the other questions that deal with improving performance for point-in-polygon queries with shapely. They suggest Rtrees. However, it seems like this is useful for cases when there are many polygons (36,000 in one question, 100,000 in another) and it is not desirable to loop over them all.

I am already setting up a bounding box, as you can see. Here's my shape setup code:

with fiona.open(SHAPEFILE) as f_col:
    all_shapes = []
    for shapefile_record in f_col:
        current_shape = {}
        current_shape['shape'] = shapely.geometry.asShape(shapefile_record['geometry'])
        minx, miny, maxx, maxy = current_shape['shape'].bounds
        current_shape['bounding_box'] = shapely.geometry.box(minx, miny, maxx, maxy)
        current_shape['properties'] = shapefile_record['properties']
        all_shapes.append(current_shape)

Would it be useful to also check another very simplified version of the shape, namely one made of the largest inscribed rectangle (or maybe triangle)?

Checking the shapely docs, it doesn't seem there's a function for this. Maybe some setting of simplify()? Of course, I always want to make sure the new simplified shape does not extend beyond the bounds of the original shape, so I don't have to call contains() on the actual shape. I also think I want to make the new simplified shape as simple as possible, for speed.

Any other suggestions appreciated for as well. Thanks!

EDIT: While awaiting replies, I have hit on this idea for creating a simplified shape meeting my requirements:

current_shape['simple_shape'] = current_shape['shape'].simplify(.5)
current_shape['simple_shape'] = current_shape['simple_shape'].intersection(current_shape['shape'])

Here's how I use it when testing each point:

if current_shape['simple_shape'].contains(point):
    return current_shape['properties']['ShapeName']
elif current_shape['shape'].contains(point):
    return current_shape['properties']['ShapeName']

This is not perfect, because the shape is not as simple as it could be after doing the necessary intersection(). Nevertheless, this approach has yielded a 60% decrease in processing time. In my tests, the simple polygon is used on 85% of point queries.

EDIT 2: Another related question over on GIS StackExchange: Python Efficiency — Need suggestions about how to use OGR and Shapely in more efficient way. This deals with 1.5 million points in about 3,000 polygons.

Georgy
  • 12,464
  • 7
  • 65
  • 73
Martin Burch
  • 2,726
  • 4
  • 31
  • 59
  • 1
    this helps me a lot. https://snorfalorpagus.net/blog/2016/03/13/splitting-large-polygons-for-faster-intersections/ – Gary Liao Feb 04 '20 at 06:18
  • Thanks @GaryLiao, this "katana" function by Joshua Arnott (split each Polygon in half across its shortest dimension, recursively) is a really delightful idea. if you have an implementation where you show how this is used in an intersection query (point-in-polygon especially) please share it, as the blog doesn't really make clear how to use the resulting split Polygons. – Martin Burch Feb 05 '20 at 10:37
  • By katana, 2 kinds of rectangles are generated: (A) those "within" the polygon, and (B) those "intersect" the edge of the polygon. It is easy to find the points within a rectangle by indexing the latitude and longitude. Points within both A and B can be find out in no seconds. We are sure that points within A are also within the polygon, so the last thing we have to do is sequentially check the points within B. – Gary Liao Feb 11 '20 at 02:25

1 Answers1

7

I would use an R-Tree. But I'd insert all your points (and not the polygon's bounding box) into the R-Tree.

use r tree index for example: http://toblerity.org/rtree/

from rtree import index
from rtree.index import Rtree

idx = index.Index();

// Inserting a point, i.e. where left == right && top == bottom, will essentially insert a single point entry into the index

for current_point in all_points:
    idx.insert(current_point.id, (current_point.x, current_point.y, current_point.x, current_point.y))

// Now loop through your polygons

for current_shape in all_shapes:
   for n in idx.intersect( current_shape['bounding_box'] )
      if current_shape['shape'].contains(n):
         # now we know that n is inside the current shape

So you end up with half a dozen queries on your larger R-Tree rather than with millions of queries on a small R-Tree.

Markus Schumann
  • 7,636
  • 1
  • 21
  • 27
  • Ah, I think this could reduce the number of comparisons, but I have over 300 million points, so I can't load them all into the index at once. Right now, I'm reading the points sequentially from a file. I'm passing them off to a multiprocessing pool that runs `find_shape()` for each. Also, most of the time, the first polygon compared contains the point. (I've optimized the order of the `all_shapes` array.) – Martin Burch Mar 16 '15 at 18:57
  • I don't think 300 million entries are too much for an index. But if your data is somewhat disjointed or partioned than load it into 30 indicies with 10 million each and perform 30 * 6 queries. Here is somebody who loaded 10 million points into an R-Tree: https://gist.github.com/acrosby/4601257 – Markus Schumann Mar 16 '15 at 21:50