Pandas apply operation on each pair of rows

Question

Same value is repeating in the dataframe column while calculating the distance between langitude and longitude . I have created a dynamic column for finding the distance between every source to destination but all the values of each column is repeating.

for a,b,x in zip(df.Longitude,df.Latitude,df.index):
    for c,d in zip(df.Longitude,df.Latitude):
        df['distance_'+str(x)]=haversine(a,b,c,d)

suppose this is the dataframe

  index   name   lat     long

    0      a     74.299104  31.481188

    1      b     74.351619  39.481188

    2      c     73.351619  39.481188

now i want the expected result like this

  index   name   lat          long     distanceA      distanceB     distanceC

    0      a     74.299104  31.481188      0             4.5          2.4

    1      b     74.351619  39.481188     5.7             0           5.8

    2      c     73.351619  39.481188     3.8            1.3           0

@ScottBoston I am editing the question with the expected result — Tayyab Vohra, Feb 18 '20 at 19:32
How many rows you have and you realize the combination for each row against the other can grow quickly depending on the number of rows? If you are trying to make a feature out of the lat lon for some modelling problem then calculating distance from one fixed point against all should give the model same information as calculating against each pair — XXavier, Feb 18 '20 at 19:41
I just have 330 rows so thats not a problem having all the pairs @XXavier — Tayyab Vohra, Feb 18 '20 at 19:41
@TayyabGulsherVohra ... what `haversine` function are you using? — Parfait, Feb 18 '20 at 20:07
Please do not share information as images unless absolutely necessary. See: https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors. — AMC, Feb 19 '20 at 02:55

score 5 · Answer 1 · answered Feb 18 '20 at 19:54

5

Since the result is mirrored along the diagonal, you can cut the iterations in half:

res = pd.DataFrame(index=df['name'], columns=df['name'])

for idx1 in df.index:
    for idx2 in df.index[idx1+1:]:
         name1, lat1, lon1 = df.loc[idx1, ['name', 'lat', 'long']]
         name2, lat2, lon2 = df.loc[idx2, ['name', 'lat', 'long']]

         res.loc[name1, name2] = haversine(lat1, lon1, lat2, lon2)

res = res.add(res.T, fill_value=0)  # res is upper triangular

answered Feb 18 '20 at 19:54

cs95

379,657
97
704
746

1

As usual. very smart! +1 – Scott Boston Feb 18 '20 at 19:55
@ScottBoston +1'd you too, I could not think of a non-loopy solution, nice effort B) – cs95 Feb 18 '20 at 19:55

score 3 · Accepted Answer · answered Feb 18 '20 at 19:54

You can try something like this using a self-join to get all combinations:

df_m = df.assign(key=1).merge(df.assign(key=1), on='key', suffixes=('','_y'))
df_m['distance'] = df_m.apply(lambda x: haversine((x.lat, x.long),(x.lat_y, x.long_y), 
                                                   unit=Unit.MILES), 
                              axis=1)
df_mat = (df_m.set_index(df.columns.to_list()+['name_y'])['distance']
              .unstack()
              .add_prefix('distance')
              .reset_index())

df.merge(df_mat)

Output:

  name        lat       long   distancea   distanceb   distancec
0    a  74.299104  31.481188    0.000000  149.269583  167.150147
1    b  74.351619  39.481188  149.269583    0.000000   69.093420
2    c  73.351619  39.481188  167.150147   69.093420    0.000000

@Parfait I followed an example of how to use haversine from PyPi. I didn't know any better. — Scott Boston, Feb 18 '20 at 20:10

score 1 · Answer 3 · answered Feb 18 '20 at 20:21

Consider a vectorized version of the haversine formula and avoid for or apply loops. For example, assuming you use this version, swap out the scalar functions for numpy versions:

import numpy as np

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    r = 3956 # Radius of earth in miles. Use 6371 for kilometers.

    return c * r

Then, run cross join with pivot_table:

cj_df = (df.assign(key=1)
           .merge(df.assign(key=1), on=['key'], suffixes=['', '_'])
        )           

cj_df['distance'] = haversine(cj_df['long'], cj_df['lat'], 
                              cj_df['long_'], cj_df['lat_'])

cj_df.pivot_table(index=['name', 'lat', 'long'],
                  columns = 'name_',
                  values = 'distance',
                  aggfunc = 'max')

# name_                              a           b           c
# name lat       long                                         
# a    74.299104 31.481188    0.000000  149.165464  167.033556
# b    74.351619 39.481188  149.165464    0.000000   69.045225
# c    73.351619 39.481188  167.033556   69.045225    0.000000

Pandas apply operation on each pair of rows

3 Answers3