0

Same value is repeating in the dataframe column while calculating the distance between langitude and longitude . I have created a dynamic column for finding the distance between every source to destination but all the values of each column is repeating.

for a,b,x in zip(df.Longitude,df.Latitude,df.index):
    for c,d in zip(df.Longitude,df.Latitude):
        df['distance_'+str(x)]=haversine(a,b,c,d)

enter image description here

suppose this is the dataframe

  index   name   lat     long

    0      a     74.299104  31.481188

    1      b     74.351619  39.481188

    2      c     73.351619  39.481188

now i want the expected result like this

  index   name   lat          long     distanceA      distanceB     distanceC

    0      a     74.299104  31.481188      0             4.5          2.4

    1      b     74.351619  39.481188     5.7             0           5.8

    2      c     73.351619  39.481188     3.8            1.3           0
cs95
  • 379,657
  • 97
  • 704
  • 746
Tayyab Vohra
  • 1,512
  • 3
  • 22
  • 49
  • create a small dataframe and show your expected results. – Scott Boston Feb 18 '20 at 19:31
  • @ScottBoston I am editing the question with the expected result – Tayyab Vohra Feb 18 '20 at 19:32
  • @ScottBoston Please check – Tayyab Vohra Feb 18 '20 at 19:37
  • How many rows you have and you realize the combination for each row against the other can grow quickly depending on the number of rows? If you are trying to make a feature out of the lat lon for some modelling problem then calculating distance from one fixed point against all should give the model same information as calculating against each pair – XXavier Feb 18 '20 at 19:41
  • I just have 330 rows so thats not a problem having all the pairs @XXavier – Tayyab Vohra Feb 18 '20 at 19:41
  • @TayyabGulsherVohra ... what `haversine` function are you using? – Parfait Feb 18 '20 at 20:07
  • Didn’t you already post this question earlier? – AMC Feb 18 '20 at 20:09
  • Please do not share information as images unless absolutely necessary. See: https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors. – AMC Feb 19 '20 at 02:55

3 Answers3

5

Since the result is mirrored along the diagonal, you can cut the iterations in half:

res = pd.DataFrame(index=df['name'], columns=df['name'])

for idx1 in df.index:
    for idx2 in df.index[idx1+1:]:
         name1, lat1, lon1 = df.loc[idx1, ['name', 'lat', 'long']]
         name2, lat2, lon2 = df.loc[idx2, ['name', 'lat', 'long']]

         res.loc[name1, name2] = haversine(lat1, lon1, lat2, lon2)

res = res.add(res.T, fill_value=0)  # res is upper triangular
cs95
  • 379,657
  • 97
  • 704
  • 746
3

You can try something like this using a self-join to get all combinations:

df_m = df.assign(key=1).merge(df.assign(key=1), on='key', suffixes=('','_y'))
df_m['distance'] = df_m.apply(lambda x: haversine((x.lat, x.long),(x.lat_y, x.long_y), 
                                                   unit=Unit.MILES), 
                              axis=1)
df_mat = (df_m.set_index(df.columns.to_list()+['name_y'])['distance']
              .unstack()
              .add_prefix('distance')
              .reset_index())

df.merge(df_mat)

Output:

  name        lat       long   distancea   distanceb   distancec
0    a  74.299104  31.481188    0.000000  149.269583  167.150147
1    b  74.351619  39.481188  149.269583    0.000000   69.093420
2    c  73.351619  39.481188  167.150147   69.093420    0.000000
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
1

Consider a vectorized version of the haversine formula and avoid for or apply loops. For example, assuming you use this version, swap out the scalar functions for numpy versions:

import numpy as np

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    r = 3956 # Radius of earth in miles. Use 6371 for kilometers.

    return c * r

Then, run cross join with pivot_table:

cj_df = (df.assign(key=1)
           .merge(df.assign(key=1), on=['key'], suffixes=['', '_'])
        )           

cj_df['distance'] = haversine(cj_df['long'], cj_df['lat'], 
                              cj_df['long_'], cj_df['lat_'])

cj_df.pivot_table(index=['name', 'lat', 'long'],
                  columns = 'name_',
                  values = 'distance',
                  aggfunc = 'max')

# name_                              a           b           c
# name lat       long                                         
# a    74.299104 31.481188    0.000000  149.165464  167.033556
# b    74.351619 39.481188  149.165464    0.000000   69.045225
# c    73.351619 39.481188  167.033556   69.045225    0.000000
Parfait
  • 104,375
  • 17
  • 94
  • 125