I currently have a working script, but I'm running into performance issues.
What I want: get the height of a building.
What I have: A file with GPS coordinates of the buildings that are in my interest (millions, with possible/certain duplicates), all located in the Netherlands A file with the height of every building in the Netherlands. This is a GML file but I am unable to load this properly using networkx, so I've got a workaround with xml.etree.ElementTree to handle it as an XML file. You can download it here if you like.
Method: from the GML file, I am able to build a dictionary of every building (roughly 3 million) that has the polygon outline coordinates as key and the height as value. for example:
dictHeights = {((157838.00015090595, 461662.000273708), (157838.00015090595, 461662.000273708), (157781.32815085226, 461515.93227361), (157781.32815085226, 461515.93227361), (157781.32815085226, 461515.93227361)): 9.41, ...}
I am able to loop through all keys and find the heigth using the following two self-made functions and this works fine, however, since I'm dealing with millions of points (adresses) and polygons (buildings) I run into serious performance issues. Getting only 5 heights already takes a few minutes... One solution might be working with a tree structure, but I can't see how this will reduce the runtime enough, since I will either have to build a huge tree which takes a lot of time, or the tree is small and then all steps are time consuming. Basically I want to get rid of the big for-loops, if possible.
import numpy as np
import matplotlib.path as mpltPath
this function is the bottleneck
def getMedianHeight(dictHeights, points):
heights = []
for point in points:
found = False
for key, value in dictHeights.items():
path = mpltPath.Path(key)
if path.contains_point(point):
heights.append(value)
found = True
break
if not found:
heights.append(-999)
return heights
I'm not so sure how I can create dummy data to replicate my data. My primary source is here. The problem is creating the dict in a proper way.
# random points set of points to test
N = 1000 #this should be about 4 million
points = zip(np.random.random(N),np.random.random(N))
#this creates a an array of polygons,but they overlap, and don't in my data.
lenpoly = 100
M = 100 # this should be about 3 million
polygons = tuple([tuple((np.sin(x)+0.5,np.cos(x)+0.5) for x in np.linspace(np.pi,lenpoly)[:-1]) for m in np.random.random(M)])
#create array of virtual heights
heights = 100*np.random.random_sample(M)
the following line is resulting in a dict with only 1 entry, I've tried about 1000 ways to build the dict in the way I want (all polygons as key, heights as value) but I am not able to do it... Either polygons
is a generator, or M generators, or it is as it should (like now) but then the dictionary isn't working properly.
dictHeights = dict((polygon, height) for polygon, height in zip(polygons, heights))
result = getMedianHeight(dictHeights, points)
To make a MWE, I'll provide a small bit of the real data:
dictHeights = {((151922.594999999, 601062.109999999), (151915.193, 601067.614999998), (151919.848000001, 601073.874000002), (151927.25, 601068.368999999), (151922.594999999, 601062.109999999)): 9.16, ((151229.125999998, 601124.223999999), (151231.934, 601113.313000001), (151225.774, 601111.728), (151222.965999998, 601122.638999999), (151229.125999998, 601124.223999999)): 7.695}
points = [(157838.00015090595, 461662.000273708), (157838.00015090595, 461662.000273708), (157781.32815085226, 461515.93227361), (157781.32815085226, 461515.93227361), (157781.32815085226, 461515.93227361)]
result = getMedianHeight(dictHeights, points)
Note: the result for this MWE is -999 for all points, since the polygons in the dict are not the correct buildings, but you should get the point :-)