0

I am brute force calculating the shortest distance from one point to many others on a 2D plane with data coming from pandas dataframes using df['column'].to_numpy().

Currently, I am doing this using nested for loops on numpy arrays to fill up a list, taking the minimum value of that list, and storing that value in another list.

Checking 1000 points (from df_point) against 25,000 (from df_compare) takes about one minute, as this is understandably an inefficient process. My code is below.

point_x = df_point['x'].to_numpy()
compare_x = df_compare['x'].to_numpy()
point_y = df_point['y'].to_numpy()
compare_y = df_compare['y'].to_numpy()
dumarr = []
minvals = []

# Brute force caclulate the closet point by using the Pythagorean theorem comparing each
# point to every other point
for k in range(len(point_x)):
    for i,j in np.nditer([compare_x,compare_y]):
        dumarr.append(((point_x[k] - i)**2 + (point_y[k] - j)**2))
    minval.append(df_compare['point_name'][dumarr.index(min(dumarr))])
    # Clear dummy array (otherwise it will continuously append to)
    dumarr = []

This isn't a particularly pythonic. Is there a way to do this with vectorization or at least without using nested for loops?

DrakeMurdoch
  • 765
  • 11
  • 26
  • You could use cdist from the scipy library to get a 1k x 25k distance matrix, then use numpy.min on the distance matrix along the appropriate axis to get your array of 1k mins. It will be much faster, assuming you have enough RAM to hold the full distance matrix in memory – sjw Apr 13 '19 at 16:54
  • @thesilkworm can you show an example of that using four arrays instead of two? – DrakeMurdoch Apr 13 '19 at 17:05
  • I assume your 4 arrays are 1d, but it would be good to confirm that (maybe even give some small examples). And don''t use `nditer`. `zip(compare_x, compare_y)` is simpler (and faster). – hpaulj Apr 13 '19 at 17:09
  • @DrakeMurdoch - it only works with two arrays, but they can be 2D arrays, as in the example I just posted. – sjw Apr 13 '19 at 17:25

4 Answers4

1

The approach is to create a 1000 x 25000 matrix, and then find the indices of the row minimums.

# distances for all combinations (1000x25000 matrix)
dum_arr = (point_x[:, None] - compare_x)**2 + (point_y[:, None] - compare_y)**2

# indices of minimums along rows
idx = np.argmin(dum_arr, axis=1)

# Not sure what is needed from the indices, this get the values 
# from `point_name` dataframe using found indices
min_vals = df_compare['point_name'].iloc[idx]
Gerges
  • 6,269
  • 2
  • 22
  • 44
0

Instead of to find the closest point, you could try finding the closest in the x and y direction separately, and then compare those two to find which is closer by using the built-in min function like the top answer from this question:

min(myList, key=lambda x:abs(x-myNumber))

from list of integers, get number closest to a given value

EDIT: Your loop would end up something like this if you do it all in one function call. Also, I'm not sure if the min function will end up looping through the compare arrays in a way that would take the same amount of time as your current code:

for k,m in np.nditer([point_x, point_y]): min = min(compare_x, compare_y, key=lambda x,y: (x-k)**2 + (y-m)**2 )

Another alternative could be to pre-compute the distance from (0,0) or another point like (-1000,1000) for all the points in the compare array, sort the compare array based on that, then only check points with a similar distance from the reference.

Matthew K
  • 86
  • 4
  • The problem here is that I need to look at the magnitude of the distances because there are cases where you won't get the right answer looking at each coordinate individually. – DrakeMurdoch Apr 13 '19 at 17:03
0

I'm gonna give you the approach :

  1. Create DataFrame with columns being ->pointID,CoordX,CoordY
  2. Create a secondary DataFrame with an offset value of 1 (oldDF.iloc[pointIDx] = newDF.iloc[pointIDx]-1)
  3. This offset value needs to be looped from 1 till the number of coordinates-1
  4. tempDF["Euclid Dist"] = sqrt(square(oldDf["CoordX"]-newDF["CoordX"])+square(oldDf["CoordY"]-newDF["CoordY"]))
  5. Append this tempDF to a list

Reasons why this will be faster:

  1. Only one loop to iterate offset from 1 till number of coordinates-1
  2. Vectorization has been taken care off by step 4
  3. Utilize numpy squareroot and square functions to ensure best results
Kaustubh J
  • 742
  • 8
  • 9
0

Here’s an example using scipy cdist, which is ideal for this type of problem:

import numpy as np
from scipy.spatial.distance import cdist

point = np.array([[1, 2], [3, 5], [4, 7]])
compare = np.array([[3, 2], [8, 5], [4, 1], [2, 2], [8, 9]])

# create 3x5 distance matrix
dm = cdist(point, compare)
# get row-wise mins
mins = dm.min(axis=1)
sjw
  • 6,213
  • 2
  • 24
  • 39