1

I would like to come up with a faster way to create a distance matrix between all lat lon pairs. This QA addresses doing a vectorized way with standard Linear Algebra, but without Lat Lon coordinates.

In my case these lat longs are farms. Here is my Python code, which for the full data set (4000 (lat, lon)'s) takes at least five minutes. Any ideas?

> def slowdistancematrix(df, distance_calc=True, sparse=False, dlim=100):
    """
    inputs: df

    returns:
    1.) distance between all farms in miles
    2.) distance^2

    """

    from scipy.spatial import distance_matrix
    from geopy.distance import geodesic

    unique_farms = pd.unique(df.pixel)
    df_unique = df.set_index('pixel')
    df_unique = df_unique[~df_unique.index.duplicated(keep='first')] # only keep unique index values
    distance = np.zeros((unique_farms.size,unique_farms.size))

    for i in range(unique_farms.size):
        lat_lon_i = df_unique.Latitude.iloc[i],df_unique.Longitude.iloc[i]
        for j in range(i):
            lat_lon_j = df_unique.Latitude.iloc[j],df_unique.Longitude.iloc[j]
            if distance_calc == True:
                distance[i,j] = geodesic(lat_lon_i, lat_lon_j).miles
                distance[j,i] = distance[i,j] # make use of symmetry

    return distance, np.power(distance, 2)
Bstampe
  • 689
  • 1
  • 6
  • 16
  • 2
    Check out [`geopy`](https://geopy.readthedocs.io/en/stable/#module-geopy.distance), [`geopandas`](http://geopandas.org/) or [`pyproj`](https://github.com/pyproj4/pyproj) for handling coordinates, dealing with great circles, etc. Check out https://stackoverflow.com/questions/31632190/measuring-geographic-distance-with-scipy. Note that you can pass your own distance functions to `scipy.spatial` functions. – Matt Hall Oct 11 '19 at 17:00

2 Answers2

3

My solution is a vectorized version of this implementation:

import numpy as np

def dist(v):
    v = np.radians(v)

    dlat = v[:, 0, np.newaxis] - v[:, 0]
    dlon = v[:, 1, np.newaxis] - v[:, 1]

    a = np.sin(dlat / 2.0) ** 2 + np.cos(v[:, 0]) * np.cos(v[:, 0]) * np.sin(dlon / 2.0) ** 2

    c = 2 * np.arcsin(np.sqrt(a))
    result = 3956 * c

    return result

However you will need to convert your dataframe to a numpy array, using the attribute values. For example:

df = pd.read_csv('some_csv_file.csv')
distances = dist(df[['lat', 'lng']].values)
Ricardo
  • 581
  • 4
  • 11
  • 1
    Thanks for this! It seems I am getting fairly divergent answers after a few hundred miles. I just asked another question here: https://stackoverflow.com/questions/58399897/when-calcuating-distance-between-points-on-earth-why-are-my-haversine-vs-geodes – Bstampe Oct 15 '19 at 17:33
  • The matrix algebra just needed a tweak. ```np.cos[:,0,None]``` should do it. – Bstampe Nov 15 '19 at 17:06
0

This is not a pure python solution, but instead it relies on having r installed with the geodist package and the rpy2 interface:

import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri

from rpy2.robjects.conversion import localconverter


def pygeodist(pd_df):
    """
    pd_df must have columns 'x' and 'y' such that 'x' is the lng coordinate
    and 'y' is the lat coordinate
    """
    geodist=importr('geodist')
    with localconverter(ro.default_converter + pandas2ri.converter):
      return geodist.geodist(pd_df, measure = "geodesic")
Shimi
  • 3
  • 1
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 28 '22 at 17:42