I'm building a program in Python that loads in the geographic location of all addresses in my country and then calculates a "remoteness" value for each address by considering its distance to all the other addresses. Since there are 3-5 million addresses in my country, I ultimately need to run 3-5 million iterations of the algorithm that calculates the index.
I have already taken some measures to bring down running time, including:
- trying to process the data efficiently using numPy
- instead of each address looking at its distance to each other address, I am dividing the country into zones, which have each already been assigned a population, and each address is then merely aware of how many of those zone centers fall within each of the distance values enumerated in "RADII"
It's still taking 23 seconds to get through a list of 5000 addresses, which means I expect 24h+ running time for the full dataset of 3-5 million.
So my question is: Which parts of my algorithm should be improved in order to save time? What stands out to you as slow, redundant or inefficient code? Does my use of JSON and numPy make sense, for example? Could simplifying the "distance" function be of much help? Or would I gain a significant advantage by throwing it all out and using another language than Python?
Any advice would be much appreciated. I'm a newbie programmer so there could easily be issues I'm not aware of.
Here's the code. It's currently working on a limited dataset (one minor island) which makes up about 0,1% of the full dataset. The algorithm starts at line 30 (#Main loop):
import json
import math
import numpy as np
RADII = [888, 1480, 2368, 3848, 6216, 10064, 16280]
R = 6371000
remoteness = []
db = SQL("sqlite:///samsozones.db")
def main():
with open("samso.json", "r") as json_data:
data = json.load(json_data)
rows = db.execute("SELECT * FROM zones")
#Establish amount of zones with addresses in them
ZONES = len(rows)
#Initialize matrix with location of the center of each zone and the population of the zone
zonematrix = np.zeros((ZONES, 3), dtype="float")
for i, row in enumerate(rows):
zonematrix[i,:] = row["x"], row["y"], row["population"]
#Initialize matrix with distance from current address to centers of each zone and the population of the zone (will be filled out for each address in the main loop)
distances = np.zeros((ZONES, 2), dtype="float")
#Main loop (calculate remoteness index for each address)
for address in data:
#Reset remoteness index for new address
index = 0
#Calculate distance from address to each zone center and insert the distances into the distances matrix along with population
for j in range(ZONES):
distances[j, 0] = distance(address["x"], address["y"], zonematrix[j, 0], zonematrix[j, 1])
distances[j, 1] = zonematrix[j, 2]
#Calculate remoteness index
for radius in RADII:
#Calculate amount of zone centers within each radius and add up their population
allwithincircle = distances[distances[:,0] < radius]
count = len(allwithincircle)
pop = allwithincircle[:,1].sum()
#Calculate average within-radius zone population
try:
factor = pop / count
except:
factor = 0
#Increment remoteness index:
index += (1 / radius) * factor
remoteness.append((address["betegnelse"], index))
# Haversine function by Deduplicator, adapted from https://stackoverflow.com/questions/27928/calculate-distance-between-two-latitude-longitude-points-haversine-formula
def distance(lat1,lon1,lat2,lon2):
dLat = deg2rad(lat2-lat1)
dLon = deg2rad(lon2-lon1)
a = math.sin(dLat/2) * math.sin(dLat/2) + math.cos(deg2rad(lat1)) * math.cos(deg2rad(lat2)) * math.sin(dLon/2) * math.sin(dLon/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = R * c
return d;
def deg2rad(deg):
return deg * (math.pi / 180)
if __name__ == "__main__":
main()
# Running time (N = 5070): 23 seconds
# Results quite reasonable
# Scales more or less linearly```