3

I have a set of Users' Latitude and Longitude and a set of Office Location Latitude Longitudes.

I have to find the office location that has the minimum average distance to all the users.

What is an efficient way of doing this in python? I have 3k users and 40k office locations...

For example:

Input: User 1 (x1, y1)
User 2 (x2,y2)
Office 1 (x3,y3)
Office 2 (x4,y4)

I then have to figure out the office location that has the least avg distance from all the users.

Office 1 is 200m from User 1 and 400 m from User 2. Avg distance from all users = 300m

Office 2 is 100m from User 1 and 200m from user 2. Avg distance from all users = 150m

Office 2 is the location of choice.

Eva611
  • 5,964
  • 8
  • 30
  • 38
  • Do you mean you are looking for ONE office location for all users? Or one office location for each user? – Parker Oct 19 '14 at 21:52
  • One office location for all users. Will clarify in question above. – Eva611 Oct 19 '14 at 21:53
  • 3
    Find the centroid of the users, then the nearest office – Peter Wood Oct 19 '14 at 21:58
  • @powerMock edited to show the input. I only have lat, longitudes. – Eva611 Oct 19 '14 at 22:10
  • 3
    The centroid idea is OK. I want to add that all your clients live and all your offices are placed on a curved surface. This fact implies that 1 degree of longitude in Canada is, say 60km and 1 degree of longitude in Mexico is, say 90km. For a relatively small area you can use a map projection and define centroids and distances using map's rectangular coordinates and have a good enough approximation, but if your area is large (continental scale, world scale) you should check carefully your results. – gboffi Oct 19 '14 at 22:28
  • 1
    @PeterWood we were wrong, consider P1=(0,0), P2=(0,0), P3=(12,0). Centroid is C=(4,0), the minimum total distance point (aka Geometric Median) is M=(0,0) as one can verify by inspection. The article en.wikipedia.org/wiki/Geometric_median explains the pitfalls we stepped into, sketches a sound computational approach and gives references to published material. Possibly the centroid may be a good approximation but definitely it's not the solution... – gboffi Oct 21 '14 at 22:51
  • @Eva611 Thank you for asking, you gave me the opportunity of changing my mind... I'd like to thank also the user that 1st doubted of the correctness of the centroid solution but the answer with her/his comments has been deleted, oh well thank you anyway. – gboffi Oct 23 '14 at 07:46

2 Answers2

1

Here's an example using the geodjango portions of django. You can do the same using shapely with pyproj. (These can be a bit of a pain to install, but once you've got everything setup, this kind of work is pretty trivial.)

from django.contrib.gis.geos import Point, MultiPoint

WGS84_SRID = 4326
office1 = Point(x1, y1, srid=WGS84_SRID )
office2 = Point(x2, y1, srid=WGS84_SRID )

# load user locations
user_locations = []
with open("user_locations.csv", "r") as in_f:
    # assuming wgs84 decimal degrees 
    # one location per line in format, 'lon, lat'
    for line in in_f:
        x, y = [float(i.strip()) for i in line.split(",")]
        user_locations.append(Point(x, y, srid=WGS84_SRID ))

# get points in a meters projection
GOOGLE_MAPS_SRID = 3857
office1_meters = office1.transform(GOOGLE_MAPS_SRID, clone=True)
office2_meters = office2.transform(GOOGLE_MAPS_SRID, clone=True)
user_locations_meters = [user_loc.transform(GOOGLE_MAPS_SRID, clone=True) for user_loc in user_locations]

# centroid method
mp = MultiPoint(user_locations, srid=4326)
centroid_distance_from_office1 = mp.centroid.distance(office1_meters)
centroid_distance_from_office2 = mp.centroid.distance(office1_meters)

print "Centroid Location: {}".format(mp.centroid.ewkt)
print("centroid_distance_from_office1: {}m".format(centroid_distance_from_office1)
print("centroid_distance_from_office2: {}m".format(centroid_distance_from_office2)

# average distance method
total_user_locations = float(len(user_locations))
office1_user_avg_distance = sum( user_loc.distance(office1_meters) for user_loc in user_locations_meters)/total_user_locations 
office2_user_avg_distance = sum( user_loc.distance(office2_meters) for user_loc in user_locations_meters)/total_user_locations 

print "avg user distance OFFICE-1: {}".format(office1_user_avg_distance)
print "avg user distance OFFICE-2: {}".format(office2_user_avg_distance)
monkut
  • 42,176
  • 24
  • 124
  • 155
1

Mostly code, implementing the algorithm in http://en.wikipedia.org/wiki/Geometric_median#Computation and gives you an example of use based on a set of random points.

NB: this is for points in a plane, because I can't decide how two spherical coordinates have to be summed... hence you have to map the spherical coordinates with a planar projection beforehand, but this point has been already touched in a previous answer.

code

from math import sqrt
from random import seed, uniform
from operator import add
seed(100)

class Point():
    """Very basic point class, supporting "+", scalar "/" and distances."""
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __repr__(self):
        return "("+repr(self.x)+","+repr(self.y)+")"
    def __add__(self, P):
        return Point(self.x+P.x, self.y+P.y)
    def __div__(self, scalar):
        return Point(self.x/float(scalar), self.y/float(scalar))
    def delta(self, P):
        dx = self.x - P.x
        dy = self.y - P.y
        return sqrt(dx*dx+dy*dy)

def iterate(GM,points):
    "Simple implementation of http://en.wikipedia.org/wiki/Geometric_median#Computation"
    # distances from the tentative GM
    distances = [GM.delta(p) for p in points]
    normalized_positions = [p/d for p,d in zip(points,distances)]
    normalization_factor = sum(1.0/d for d in distances)
    new_median = reduce(add, normalized_positions)/normalization_factor
    return new_median

# The "clients"
nclients = 10
points = [Point(uniform(-3,3),uniform(-3,3)) for i in range(nclients)]

# Centroid of clients and total of distances
centroid = reduce(add,points)/nclients
print "Position of centroid:",centroid
print "Sum of distances from centroid:",
print reduce(add,[centroid.delta(p) for p in points])


print
print "Compute the Geometric Median using random starting points:"
nstart = 10
for P0 in [Point(uniform(-5,5),uniform(-5,5)) for i in range(nstart)]:
    p0 = P0
    for i in range(10):
        P0 = iterate(P0, points)
    print p0,"-->",P0

print
print "Sum of distances from last Geometric Median:",
print reduce(add,[P0.delta(p) for p in points])

output

Position of centroid: (-0.4647467432024398,0.08675910209912471)
Sum of distances from centroid: 22.846445119

Compute the Geometric Median using random starting points:
(1.2632163919279735,4.633157837008632) --> (-0.8739691868669638,-0.019827884361901298)
(-2.8916600791314986,4.561006461166512) --> (-0.8929310891388812,-0.025857080003665663)
(0.5539966580106901,4.011520429873922) --> (-0.8764828849474395,-0.020607834485528134)
(3.1801819335743033,-3.395781900250662) --> (-0.8550062003820846,-0.014134334529992666)
(1.48542908120573,-3.7590671941155627) --> (-0.8687797019011291,-0.018241177226221747)
(-4.943549141082007,-1.044838193982506) --> (-0.9066276248482427,-0.030440865315529194)
(2.73500702168781,0.6615770729288597) --> (-0.8231318436739281,-0.005320464433689587)
(-3.073593440129266,3.411747144619733) --> (-0.8952513352350909,-0.026600471220747438)
(4.137768422492282,-2.6277493707729596) --> (-0.8471586848200597,-0.011875801531868494)
(-0.5180751681772549,1.377998063140823) --> (-0.8849056106235963,-0.02326386487180884)

Sum of distances from last Geometric Median: 22.7019120091

my own comment

In this case the locations (centroid vs GM) are quite different, but the results are similar. I expect significant differences in both the locations and the mean distances when you have some sort of clustering around a point (a city), or about some features say a line (a road) etc

Eventually, one can speed up things using numpy, I've avoided doing numpy due to limited time resources :)

gboffi
  • 22,939
  • 8
  • 54
  • 85