16

I know that to find the distance between two latitude, longitude points I need to use the haversine function:

def haversine(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c
    return km

I have a DataFrame where one column is latitude and another column is longitude. I want to find out how far these points are from a set point, -56.7213600, 37.2175900. How do I take the values from the DataFrame and put them into the function?

example DataFrame:

     SEAZ     LAT          LON
1    296.40,  58.7312210,  28.3774110  
2    274.72,  56.8148320,  31.2923240
3    192.25,  52.0649880,  35.8018640
4     34.34,  68.8188750,  67.1933670
5    271.05,  56.6699880,  31.6880620
6    131.88,  48.5546220,  49.7827730
7    350.71,  64.7742720,  31.3953780
8    214.44,  53.5192920,  33.8458560
9      1.46,  67.9433740,  38.4842520
10   273.55,  53.3437310,   4.4716664
EdChum
  • 376,765
  • 198
  • 813
  • 562
user3755536
  • 191
  • 1
  • 2
  • 8
  • I've posted an answer that solves your problem but I think it is sub-optimal – EdChum Sep 10 '14 at 14:16
  • For a small df my answer would be OK but for a much larger df you'd be better off converting the degrees to radians and storing this and then performing the calculation on the whole dataframe – EdChum Sep 10 '14 at 14:22
  • I have 143 values in the df and this seems to have worked well. – user3755536 Sep 10 '14 at 14:31
  • OK but generally using `apply` should be a last resort, there are vectorised functions in pandas and numpy that will perform math operations on the whole df, please check the numbers I can't guarantee anything – EdChum Sep 10 '14 at 14:32

1 Answers1

35

I can't confirm if the calculations are correct but the following worked:

In [11]:

from numpy import cos, sin, arcsin, sqrt
from math import radians

def haversine(row):
    lon1 = -56.7213600
    lat1 = 37.2175900
    lon2 = row['LON']
    lat2 = row['LAT']
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * arcsin(sqrt(a)) 
    km = 6367 * c
    return km

df['distance'] = df.apply(lambda row: haversine(row), axis=1)
df
Out[11]:
         SEAZ        LAT        LON     distance
index                                           
1      296.40  58.731221  28.377411  6275.791920
2      274.72  56.814832  31.292324  6509.727368
3      192.25  52.064988  35.801864  6990.144378
4       34.34  68.818875  67.193367  7357.221846
5      271.05  56.669988  31.688062  6538.047542
6      131.88  48.554622  49.782773  8036.968198
7      350.71  64.774272  31.395378  6229.733699
8      214.44  53.519292  33.845856  6801.670843
9        1.46  67.943374  38.484252  6418.754323
10     273.55  53.343731   4.471666  4935.394528

The following code is actually slower on such a small dataframe but I applied it to a 100,000 row df:

In [35]:

%%timeit
df['LAT_rad'], df['LON_rad'] = np.radians(df['LAT']), np.radians(df['LON'])
df['dLON'] = df['LON_rad'] - math.radians(-56.7213600)
df['dLAT'] = df['LAT_rad'] - math.radians(37.2175900)
df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin(df['dLAT']/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(df['LAT_rad']) * np.sin(df['dLON']/2)**2))

1 loops, best of 3: 17.2 ms per loop

Compared to the apply function which took 4.3s so nearly 250 times quicker, something to note in the future

If we compress all the above in to a one-liner:

In [39]:

%timeit df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin((np.radians(df['LAT']) - math.radians(37.2175900))/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(np.radians(df['LAT'])) * np.sin((np.radians(df['LON']) - math.radians(-56.7213600))/2)**2))
100 loops, best of 3: 12.6 ms per loop

We observe further speed ups now a factor of ~341 times quicker.

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thank you for the optimization comparisons. Good Work. – deepelement Aug 08 '17 at 14:14
  • check the equations from this link: https://www.movable-type.co.uk/scripts/latlong.html – Ahmad Senousi Jan 25 '19 at 04:15
  • You should remove this term ```np.cos(np.radians(df['LAT'])``` when using np.arcsin(area) – Ahmad Senousi Apr 05 '19 at 10:14
  • @AhmadSuliman sorry can you suggest what the full line should be and explain why as it's unclear to me why at the moment – EdChum Apr 05 '19 at 10:16
  • @AhmadSuliman your proposed edit was rejected for some reason, I tried to get your code to run and it produced very different results. Can you propose an edit that is code correct and provide sample and data and results that are verifiable – EdChum Apr 08 '19 at 08:54
  • Thanks for this answer. Insanely fast compared to applying hs.haversine row-wise: 40 seconds on 1.13 million rows for hs.haversine compared to 160ms for this vectorized version. However, the output doesn't match hs.haversine unless you use 6,371 km instead of 6,367. From wikipedia: "A globally-average value [of earth's radius] is usually considered to be 6,371 kilometres (3,959 mi) with a 0.3% variability (±10 km)" (https://en.wikipedia.org/wiki/Earth_radius) – sunday_funday Dec 27 '21 at 16:16