1

My data frame has 16 x and y coordinates for positions{x1,x2...x16,y1,y2...y16} continuous from 0 to 566(m). I want to calculate Euclidean distance between x1,y1 wrt the remaining 15 coordinates and get the overall average distance. I want to repeat the process for all the 16 coordinates in the data frame and get their average distances from the other points.

My Data frame

import pandas as pd

import plotly.express as px

sample = {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
          'columns': ['X1',
          'X2',
          'X3',
          'X4',
          'X5',
          'X6',
          'X7',
          'X8',
           'X9',
           'X10',
           'X11',
           'X12',
           'X13',
           'X14',
           'X15',
          'X16',
           'Y1',
           'Y2',
          'Y3',
          'Y4',
           'Y5',
           'Y6',
           'Y7',
           'Y8',
           'Y9',
           'Y10',
           'Y11',
           'Y12',
           'Y13',
           'Y14',
           'Y15',
          'Y16'],
          'data': [[500.4677, 278.6497, 47.4062, 417.3653, 551.7083, 401.1797, 0.0, 161.7773, 368.1543, 45.985, 520.4714, 566.0, 219.715, 45.0157, 284.9714, 202.886, 26.0476, 566.0, 0.0, 104.1045, 0.0, 335.7074, 486.5247, 566.0, 566.0, 258.2014, 376.2201, 477.4717, 412.28, 96.7684, 90.1294, 364.7503], 
          [511.751, 104.4383, 566.0, 380.4079, 345.8587, 0.0, 90.5588, 7.2899, 566.0, 566.0, 566.0, 313.0356, 191.8169, 200.6335, 566.0, 383.5991, 566.0, 135.7882, 463.57800000000003, 366.7474, 447.0349, 0.0, 410.16, 246.4883, 0.0, 208.0817, 60.8637, 566.0, 48.4488, 117.445, 411.2962, 566.0], 
         [19.699, 216.4378, 355.296, 67.8151, 518.7256, 72.1572, 222.7933, 223.9242, 566.0, 312.4474, 511.7909, 566.0, 78.0924, 226.1336, 566.0, 465.5424, 0.0, 566.0, 447.2046, 259.9073, 566.0, 320.6664, 418.3566, 351.0215, 354.9378, 566.0, 391.7332, 99.2301, 0.0, 137.3001, 535.2882, 566.0], 
         [0.0, 0.0, 243.342, 0.0, 566.0, 0.0, 198.4878, 0.0, 566.0, 566.0, 509.8984, 285.704, 144.3917, 294.7953, 399.9559, 345.9918, 380.3994, 462.8946, 566.0, 566.0, 0.0, 0.0, 234.3866, 53.6411, 546.4076, 253.0758, 0.0, 53.9287, 396.4652, 0.0, 0.0, 10.9969], 
         [446.6611, 421.0366, 493.1895, 566.0, 516.3773, 566.0, 566.0, 474.81699999999995, 566.0, 0.0, 299.7096, 30.5458, 0.0, 170.8087, 178.8432, 187.1938, 278.2215, 0.0, 551.9664, 566.0, 373.9421, 313.9131, 485.6295, 433.9819, 0.0, 481.3123, 0.0, 566.0, 0.0, 6.0358, 566.0, 465.4973], 
         [398.0662, 0.0, 172.95, 480.6902, 566.0, 566.0, 257.4459, 67.4326, 5.4764, 566.0, 0.0, 62.1571, 143.9684, 566.0, 236.2575, 341.5666, 102.6265, 484.4441, 7.9433, 0.0, 320.2722, 26.7175, 566.0, 540.0883, 349.5919, 213.1047, 0.0, 0.0, 566.0, 566.0, 503.7899, 116.725], 
         [113.7286, 341.2415, 418.0059, 0.0, 566.0, 566.0, 566.0, 0.0, 0.0, 566.0, 68.0914, 463.0868, 140.684, 18.4887, 220.1713, 273.2086, 566.0, 0.0, 270.9991, 503.3479, 0.0, 89.0062, 509.5509, 566.0, 287.7841, 566.0, 334.3373, 54.4844, 0.0, 206.8418, 396.798, 566.0], 
         [0.0, 0.0, 0.0, 557.9507, 0.0, 566.0, 530.5327, 566.0, 461.61400000000003, 495.1422, 243.7463, 515.1805, 337.0039, 149.3637, 432.7603, 375.4085, 216.9741, 566.0, 424.6422, 405.3143, 323.5063, 160.0276, 0.0, 566.0, 566.0, 420.6266, 566.0, 566.0, 58.6311, 255.0382, 0.0, 566.0], 
         [369.2071, 304.0515, 566.0, 164.3125, 552.6091, 557.8421, 437.3143, 395.9587, 417.7882, 0.0, 0.0, 198.0035, 121.6725, 169.035, 225.88400000000001, 12.7346, 0.0, 476.0819, 0.0, 503.4461, 566.0, 447.1827, 53.9382, 566.0, 484.8547, 0.0, 566.0, 566.0, 566.0, 0.0, 362.2357, 269.9263], 
         [39.8478, 231.8411, 176.4778, 0.0, 566.0, 0.0, 566.0, 202.1302, 566.0, 291.1011, 0.0, 511.6552, 566.0, 113.6815, 566.0, 0.0, 512.7862, 177.8265, 85.63799999999999, 566.0, 0.0, 6.2889, 305.5369, 0.0, 566.0, 529.326, 136.871, 502.9318, 117.4366, 130.5829, 168.2698, 376.7318]]}



df = pd.DataFrame(index=sample['index'], columns=sample['columns'], 
data=sample['data'])
df      

                                                                           

I have tried with below math formulae for all the coordinated individually and added all the distanced for the coordinate 1 to get the average distance for coordinate 1.

x1 = df.X1
x1 = df.X2
y1 = df.Y1
y2 = df.Y2

d12 = np.sqrt(np.square( x2 - x1 ) + np.square( y2 - y1 ))
avgd1 = (d12+d13+d14+d15+d16+d17+d18+d19+d110+d111+d112+d113+d114+d115+d116)/16.00

Similarly, I have taken average distances for all the 16 coordinates. But its not an efficient way and the code is getting very lengthy. Also my data set is very huge. So i want to know if there are any other efficient methods for my problem.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
user132605
  • 11
  • 4
  • 1
    Does this answer your question? [Distance calculation between rows in Pandas Dataframe using a distance matrix](https://stackoverflow.com/questions/20303323/distance-calculation-between-rows-in-pandas-dataframe-using-a-distance-matrix) – Trenton McKinney Sep 28 '20 at 04:38

1 Answers1

0

It is possible to compute your distances in quite a concise way, using Numpy functions and its broadcasting feature.

Define a function generating average distances for a row from your source DataFrame:

def avgDist(row):
    xy = row.reshape(2, -1).T
    sqDist = ((xy[:, np.newaxis, :] - xy[np.newaxis, :, :]) ** 2).sum(axis=2)
    return np.sqrt(sqDist).sum(axis=1) / (row.size // 2 - 1)

The source row is a 1-D Numpy array and the result is also a 1-D Numpy array.

Then, to generate the output DataFrame, run:

result = df.join(pd.DataFrame(np.apply_along_axis(avgDist, 1, df.values),
    columns=[ col.replace('X', 'avgd') for col in df.columns[:df.shape[1] // 2] ],
    index=df.index))

To test the above code, I prepared the following DataFrame:

       X1      X2     X15     X16      Y1      Y2     Y15     Y16
0    2.00    4.00    2.00    1.00    0.00    2.00    3.00    1.00
1  500.47  278.65  284.97  202.89   26.05  566.00   90.13  364.75
2  511.75  104.44  566.00  383.60  566.05  135.79  411.30  566.00
3   19.70  216.44  566.00  465.54    0.00  566.00  535.29  566.00
4    0.00    0.00  399.96  345.99  380.40  462.89    0.00   11.00
5  446.66  421.04  178.84  187.19  278.22    0.00  566.00  465.50

based on your picture (4 columns instead of 16), with an extra (first) row, with points so located that is is easy to verify the results.

The result is:

       X1      X2     X15     X16      Y1      Y2     Y15     Y16       avgd1       avgd2      avgd15      avgd16
0    2.00    4.00    2.00    1.00    0.00    2.00    3.00    1.00    2.414214    2.742258    2.490712    2.270853 
1  500.47  278.65  284.97  202.89   26.05  566.00   90.13  364.75  419.806664  424.895850  329.120455  317.505970 
2  511.75  104.44  566.00  383.60  566.05  135.79  411.30  566.00  294.869112  547.618066  313.562415  293.388341 
3   19.70  216.44  566.00  465.54    0.00  566.00  535.29  566.00  694.854626  399.741581  406.931584  358.218465 
4    0.00    0.00  399.96  345.99  380.40  462.89    0.00   11.00  380.196350  421.123916  406.266123  376.780534 
5  446.66  421.04  178.84  187.19  278.22    0.00  566.00  465.50  330.838905  471.992748  369.870561  313.927197 

Details how this code works:

  1. Take the first row from the source DataFrame, as a Numpy array:

     row = df.values[0]
    

    The result is: array([2., 4., 2., 1., 0., 2., 3., 1.]).

  2. Generate the first row of the above result, running avgDist(row). The result is array([2.41421356, 2.74225759, 2.49071198, 2.27085307]).

  3. np.apply_along_axis calls avgDist on each row of your source DataFrame (converted to a Numpy array) and concatenates results as consecutive rows of the output, also as a Numpy array.

  4. columns=[...] convert column names from the first half of df ("X..." columns), replacing "X" with "avgd".

  5. The argument of join is a DataFrame, with columns generated as above and the original index.

  6. The last step is to join both DataFrames (on the index).

The "deeper level" of details - how avgDist works:

  1. xy = ... - take the source row and transform it into a 2-column array. The first column contains x coordinates, and the second - y.

    For the first row of df the result is:

     array([[2., 0.],
            [4., 2.],
            [2., 3.],
            [1., 1.]])
    
  2. xy[:, np.newaxis, :] - xy[np.newaxis, :, :] computes differences in each coordinate between every pair of points. The result is:

     array([[[ 0.,  0.],
             [-2., -2.],
             [ 0., -3.],
             [ 1., -1.]],
    
            [[ 2.,  2.],
             [ 0.,  0.],
             [ 2., -1.],
             [ 3.,  1.]],
    
            [[ 0.,  3.],
             [-2.,  1.],
             [ 0.,  0.],
             [ 1.,  2.]],
    
            [[-1.,  1.],
             [-3., -1.],
             [-1., -2.],
             [ 0.,  0.]]])
    
  3. sqDist = ((...) ** 2).sum(axis=2) computes squares of these differences and sums them for each "point to point" pair. The result is:

     array([[ 0.,  8.,  9.,  2.],
            [ 8.,  0.,  5., 10.],
            [ 9.,  5.,  0.,  5.],
            [ 2., 10.,  5.,  0.]])
    

    No surprise, the diagonal contains zeroes, as they are distances from a point to just the same point.

  4. The returned array is a sum of the above array, for each row, divided by the number of points - 1.

Maybe you should also search the Web for examples of how to use np.newaxis to perform "each with each" computation on a sequence of objects (in this case such an object is a row of xy array, representing a point.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41