0

I have the following dataframe

region   country   val_a   val_b 
reg1     cntr1     0.5      0.7
reg2     cntr1     1        2
reg3     cntr1     2        1.2
reg1     cntr2     3        0.3
reg44    cntr2     0.2      0.7

I want to loop through this dataframe comparing each row with the other rows, finding the euclidean distance between val_a and val_b for each pair of rows and creating the following dataframe

source_region    source_country    dest_region    dest_country    distance
reg1              cntr1             reg2           cntr1            0.9
reg1               cntr1            reg3           cntr1            1.5

...

I can do a nested loop to create something like this but is there a more pythonic way to accomplish it? Please note that the distance column values are random in this example. You can use any formula you like to compute the euclidean distance, I just want to get the logic for pairwise comparison correct.

user308827
  • 21,227
  • 87
  • 254
  • 417
  • Does this answer your question? [How to efficiently calculate euclidean distance matrix for several timeseries](https://stackoverflow.com/questions/62258519/how-to-efficiently-calculate-euclidean-distance-matrix-for-several-timeseries) – Gene Burinsky Jun 12 '23 at 16:28
  • Hello `user308827`. Was my answer useful or do you still need help doing what you want? – user19077881 Jun 14 '23 at 16:38

1 Answers1

0

You can use combinations from itertools. This can produce a list of all the combinations of row indices which can then be used to form a new dataframe. Clearly the distance calculation in the code below is merely illustrative. The code is done in steps so you can look at the intermediaries if required.

import pandas as pd
import numpy as np
from itertools import combinations

idxs = list(combinations(df.index,2))
dfx = pd.DataFrame(np.concatenate((df.iloc[idx[0]].values, df.iloc[idx[1]].values)) for idx in idxs)
dfx.columns = ['s_region', 's_country', 's_1', 's_2', 'd_region', 'd_country', 'd_1', 'd_2']
dfx['distance'] = dfx['s_1']-dfx['s_2']+dfx['d_1']-dfx['d_2']
print(dfx)

which gives:

  s_region s_country  s_1  s_2 d_region d_country  d_1  d_2  distance
0     reg1     cntr1  0.5  0.7     reg2     cntr1  1.0  2.0      -1.2
1     reg1     cntr1  0.5  0.7     reg3     cntr1  2.0  1.2       0.6
2     reg1     cntr1  0.5  0.7     reg1     cntr2  3.0  0.3       2.5
3     reg1     cntr1  0.5  0.7    reg44     cntr2  0.2  0.7      -0.7
4     reg2     cntr1  1.0  2.0     reg3     cntr1  2.0  1.2      -0.2
5     reg2     cntr1  1.0  2.0     reg1     cntr2  3.0  0.3       1.7
6     reg2     cntr1  1.0  2.0    reg44     cntr2  0.2  0.7      -1.5
7     reg3     cntr1  2.0  1.2     reg1     cntr2  3.0  0.3       3.5
8     reg3     cntr1  2.0  1.2    reg44     cntr2  0.2  0.7       0.3
9     reg1     cntr2  3.0  0.3    reg44     cntr2  0.2  0.7       2.2

This assumes that source A, dest B is the same as source B, dest A since both distances will the same. If you want to have every source and every dest then use permutations instead of combinations.

user19077881
  • 3,643
  • 2
  • 3
  • 14