Performance issues: find in which polygon a point is located

Question

I currently have a working script, but I'm running into performance issues.

What I want: get the height of a building.

What I have: A file with GPS coordinates of the buildings that are in my interest (millions, with possible/certain duplicates), all located in the Netherlands A file with the height of every building in the Netherlands. This is a GML file but I am unable to load this properly using networkx, so I've got a workaround with xml.etree.ElementTree to handle it as an XML file. You can download it here if you like.

Method: from the GML file, I am able to build a dictionary of every building (roughly 3 million) that has the polygon outline coordinates as key and the height as value. for example:

dictHeights = {((157838.00015090595, 461662.000273708), (157838.00015090595, 461662.000273708), (157781.32815085226, 461515.93227361), (157781.32815085226, 461515.93227361), (157781.32815085226, 461515.93227361)): 9.41, ...}

I am able to loop through all keys and find the heigth using the following two self-made functions and this works fine, however, since I'm dealing with millions of points (adresses) and polygons (buildings) I run into serious performance issues. Getting only 5 heights already takes a few minutes... One solution might be working with a tree structure, but I can't see how this will reduce the runtime enough, since I will either have to build a huge tree which takes a lot of time, or the tree is small and then all steps are time consuming. Basically I want to get rid of the big for-loops, if possible.

import numpy as np
import matplotlib.path as mpltPath

this function is the bottleneck

def getMedianHeight(dictHeights, points):
    heights = []
    for point in points:
        found = False
        for key, value in dictHeights.items():
            path = mpltPath.Path(key)
            if path.contains_point(point):
                heights.append(value)
                found = True
                break
        if not found:
            heights.append(-999)
    return heights

I'm not so sure how I can create dummy data to replicate my data. My primary source is here. The problem is creating the dict in a proper way.

# random points set of points to test 
N = 1000 #this should be about 4 million
points = zip(np.random.random(N),np.random.random(N))

#this creates a an array of polygons,but they overlap, and don't in my data.
lenpoly = 100 
M = 100 # this should be about 3 million

polygons = tuple([tuple((np.sin(x)+0.5,np.cos(x)+0.5) for x in np.linspace(np.pi,lenpoly)[:-1]) for m in np.random.random(M)])

#create array of virtual heights
heights = 100*np.random.random_sample(M)

the following line is resulting in a dict with only 1 entry, I've tried about 1000 ways to build the dict in the way I want (all polygons as key, heights as value) but I am not able to do it... Either polygons is a generator, or M generators, or it is as it should (like now) but then the dictionary isn't working properly.

dictHeights = dict((polygon, height) for polygon, height in zip(polygons, heights))

result = getMedianHeight(dictHeights, points)

To make a MWE, I'll provide a small bit of the real data:

dictHeights = {((151922.594999999, 601062.109999999), (151915.193, 601067.614999998), (151919.848000001, 601073.874000002), (151927.25, 601068.368999999), (151922.594999999, 601062.109999999)): 9.16, ((151229.125999998, 601124.223999999), (151231.934, 601113.313000001), (151225.774, 601111.728), (151222.965999998, 601122.638999999), (151229.125999998, 601124.223999999)): 7.695}
points = [(157838.00015090595, 461662.000273708), (157838.00015090595, 461662.000273708), (157781.32815085226, 461515.93227361), (157781.32815085226, 461515.93227361), (157781.32815085226, 461515.93227361)]
result = getMedianHeight(dictHeights, points)

Note: the result for this MWE is -999 for all points, since the polygons in the dict are not the correct buildings, but you should get the point :-)

Why don't you split your data into a rough tree of degrees (or minutes) of latitude/longitude when loading? You can then select only the intersecting subset of your search point (at most 2x2, unless there is a building that spans more than one degree/minute) and do the _collision_ testing with the buildings only in that subset instead of checking every single building in the whole dataset. That's the roughest optimization you can do but it should improve your performance tremendously, although it will slow down when searching for a point in Utrecht/Amsterdam compared to a field in Friesland. — zwer, Aug 15 '18 at 09:58
I've thought about this (but I don't know yet how to build the subsets/tree structure) but I think this will not solve the problem: you either have lots of branches/subsets that are small, or you will end up with few large branches/subsets. Having more subsets results in a longer time to find the correct subset, but searching within a subset goes faster. This is a trade-off of course, but finding the optimal balance, if a fast solution exists at all, is going to be a very time consuming task... So do you have any guiding of how to implement you idea? — JtenBulte, Aug 15 '18 at 10:48
have you TRIED to do it by chunks? i think if you write the code you will see it will be MUCH faster — AntiMatterDynamite, Aug 15 '18 at 11:33

Performance issues: find in which polygon a point is located

this function is the bottleneck

0 Answers0