1

Assume a list of points or nodes. each one of them has x y and z coordinates . the distance between two points i and j equal D(i,j)= sqrt((xi-xj)^2+(yi-yj)^2+(zi-zj)^2). Here I got 400000 data points.

Now, I want to select a set of these nodes which have equal distances between them (the inter-distance is specified previously --> 0.05). Hence the selected points are uniformly distributed.

If run with a while loop, it takes approx 3h to complete the entire data set. Looking for fastest method.

no_rows = len(df)
i = 1
while i < no_rows:
    a1 = df.iloc[i-1, 1]
    a2 = df.iloc[i, 1]
    b1 = df.iloc[i-1, 2]
    b2 = df.iloc[i, 2]
    c1 = df.iloc[i-1, 3]
    c2 = df.iloc[i, 3]
    dist = np.round(((a2-a1)**2+(b2-b1)**2+(c2-c1)**2)**0.5,5)
    df.iloc[i, 6]= dist
    if dist < 0.05000:
                df = df.drop(i)
                df.reset_index(drop = True, inplace = True)
                no_rows = len(df)
                i = i-1
    i+=1
Frodon
  • 3,684
  • 1
  • 16
  • 33
Arun
  • 13
  • 2
  • Hi @Arun, have a look at this approach and see if that works for you: https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy . I think the answer could be to get your data into an array and then use numpy's vectorised operations to get things sped up. – user6386471 Nov 20 '20 at 08:32

1 Answers1

1

EDIT

An option would be to use directly pandas and merging the dataframe over itself. Something like :

import pandas as pd
import numpy as np

df = pd.DataFrame([
  [131.404866,16.176877,128.120177 ], 
  [131.355045,16.176441,128.115972 ], 
  [131.305224,16.176005,128.111767 ], 
  [131.255403,16.175569,128.107562 ], 
  [131.205582,16.175133,128.103357 ], 
  [131.158858,16.174724,128.099413 ], 
  [131.15576,16.174702,128.09916 ], 
  [131.105928,16.174342,128.095089 ], 
  [131.05988,16.174009,128.091328 ], 
  [131.056094,16.173988,128.09103 ], 
  [131.006249,16.173712,128.087107 ], 
  [130.956404,16.173436,128.083184], 
  ],
  columns=['x', 'y', 'z']
)
df.reset_index(drop=False, inplace=True)

dist = 0.05
df['CROSS'] = 1
df = df.merge(df, on="CROSS")
df.reset_index(drop=True, inplace=True)

df['distance'] = np.round(
    np.sqrt(
      np.square(df['x_x'] - df['x_y'])
      + np.square(df['y_x']-df['y_y'])
      + np.square(df['z_x']-df['z_y'])
    ),
    5
)

#drop values where distances are = 0 (same points)
ix = df[df.distance==0].index 
df.drop(ix, inplace=True)


print('These are all pair of points which are matching the distance', dist)
ix = df[df.distance.astype(float)==dist].index
df.sort_values('distance', inplace=True)
print(df.loc[ix])
print('-'*50)


points = pd.DataFrame(
        df.loc[ix, ['index_x', 'x_x', 'y_x', 'z_x']].values.tolist() 
        + df.loc[ix, ['index_y', 'x_y', 'y_y', 'z_y']].values.tolist(), 
        columns=['index', 'x', 'y', 'z'])
points.drop_duplicates(keep='first', inplace=True)
print('These are all the points which have another at distance', dist)
print(points)

Numpy's function are way faster than any loop and will allow you to treat the whole dataset at the same time.

Another could be to use geopandas (it can be very fast also, but I'm not sure this would be the case here : the fastest method involves pyproj's distance computation (written in C) and I don't think there is any declination in 3D)

tgrandje
  • 2,332
  • 11
  • 33
  • great answer - just a small typo in the last call to np.square. Also, I think that OP might be interested in keeping rows where distance >= `dist`, so it's just a replacement of `==` with `>=` in the assignment to `ix`. – user6386471 Nov 20 '20 at 09:10
  • @user6386471 Right about the typo, thx ! Regarding the distance's operator : yes, this is what is in the code, but not what is in the question... So I'll let him check whichever it is :-) – tgrandje Nov 20 '20 at 09:17
  • Good point! Let's see what @Arun comes back with :D – user6386471 Nov 20 '20 at 09:39
  • above code didn't work.tried both option distance == 0.05 and distance >=0.05. – Arun Nov 20 '20 at 13:45
  • You have to give me a little more informations to help. What are the errors ? – tgrandje Nov 20 '20 at 14:03
  • @Arun The error may be due to the fact I assumed your columns where named "x", "y" and "z". Could you post a sample of your dataframe ? – tgrandje Nov 20 '20 at 14:20
  • Hi tgrangje, I used first 1500 points but the result increments to 153000 points. – Arun Nov 22 '20 at 17:55
  • @Arun : at which step did you get 153000 points ? The cartesian product (ie the merge step) should give you 1500**2 = 2 250 000 points. The rest should be the list of pairs corresponding to your distance criteria. Please post a sample of your dataframe, it will allow to help you way better... – tgrandje Nov 22 '20 at 19:07
  • I've edited the code to include your sample ; I also added the "round" method which wasn't there and might be an explanation why you thought the code was not working (and removed the rows which concerned points beeing compared to themselves) – tgrandje Nov 23 '20 at 10:53
  • Thank you, quick one, provided 12 data points, but the result shows for 18 points. If you skip point 6 and 9 from the data frame, all the data points will be in the interval of 0.05 – Arun Nov 23 '20 at 11:35
  • That's because the dataframe returned gave you pairs of points (the distance is computed against 2 points, right ?). I just edited the answer to be more clear and show you how to capture those points in a new dataframe (which I hope will be closer to what you are waiting for) – tgrandje Nov 23 '20 at 13:32
  • Thank you, if you look at the new data look similar to the original data frame. Result should something look similar to this – Arun Nov 24 '20 at 08:11
  • x y z dist 131.404866 16.176877 128.120177 0 131.355045 16.176441 128.115972 0.05 131.305224 16.176005 128.111767 0.05 131.255403 16.175569 128.107562 0.05 131.205582 16.175133 128.103357 0.05 131.15576 16.174702 128.09916 0.05 131.105928 16.174342 128.095089 0.05 131.056094 16.173988 128.09103 0.05 131.006249 16.173712 128.087107 0.05 130.956404 16.173436 128.083184 0.05 – Arun Nov 24 '20 at 08:20
  • You're right, I made a typo when trying to show you how to gather the points from the dataframe with all the couples. It is corrected now. I added another column "index" to trace the points by their order in the first dataset : feel free to drop it if you don't need it. (For further posts : please post the dataframes the way I did (or at least as a dictionnary) ; posting a list of values is a pain to reuse...) – tgrandje Nov 24 '20 at 08:58