3

I have two .csv files of 3D points (numeric coordinate data) and associated attribute data (strings + numeric). I need to calculate the Euclidean distance between each point and every other point, and maintain the attribute data for each point associated with the difference. I have a method that works for this, but it uses a loop and I'm hoping that there is a better way to do this that is less resource intensive. Here is the code I am using currently:

import pandas as pd
import numpy as np

# read .csv
dataset_1 = pd.read_csv(dataset1 path)
dataset_2 = pd.read_csv(dataset2 path)

# convert to numpy array
array_1 = dataset_1.to_numpy()
array_2 = dataset_2.to_numpy()

# define data types for new array. This includes the attribute data I want to maintain
data_type = np.dtype('f4, f4, f4, U10, U10, f4, f4, f4, U10, U10, U10, f4, f4, U10, U100')

#define the new array
new_array = np.empty((len(array_1)*len(array_2)), dtype=data_type)

#calculate the Euclidean distance between each set of 3D coordinates, and populate the new array with the results as well as data from the input arrays
number3 = 0
for number in range(len(array_1)):
        for number2 in range(len(array_2)):
                Euclidean_Dist = np.linalg.norm(array_1[number, 0:3]-array_2[number2, 0:3])
                new_array[number3] = (array_1[number, 0], array_1[number, 1], array_1[number, 2], array_1[number, 3], array_1[number, 7],
                 array_2[number2, 0], array_2[number2, 1],array_2[number2, 2], array_2[number2, 3], array_2[number2, 6], array_2[number2, 7],
                 array_2[number2, 12], array_2[number2, 13], dist,''.join(sorted((str(array_2[number2, 0]) + str(array_2[number2, 1]) + str(array_2[number2, 2]) + str(array_2[number2, 3])))))
                number3+=1   
                
#Convert results to pandas dataframe
new_df = pd.DataFrame(new_array)

I work with very large datasets, so if anyone could suggest a more efficient way to do this I would be very grateful.

Thanks,

The code presented above works for my problem, but I'm looking for something to improve efficiency

Edit to show example input datasets (dataset_1 & dataset_2) and desired output dataset (new_df). The key is that for the output dataset I need to maintain the attributes from the input dataset associated with the Euclidean Distance. I could use scipy.spatial.distance.cdist to calculate the distances, but I'm not sure of the best way to maintain the attributes from the input data in the output data.

enter image description here

COIh0rp
  • 63
  • 4
  • Can you add to your question the first 5 rows of `dataset_1` and `dataset_1` please ? And `new_df` ? – Khaled DELLAL Nov 18 '22 at 08:24
  • Does this answer your question? [How to do n-D distance and nearest neighbor calculations on numpy arrays](https://stackoverflow.com/questions/52366421/how-to-do-n-d-distance-and-nearest-neighbor-calculations-on-numpy-arrays) – Daniel F Nov 18 '22 at 08:56
  • I've edited to show example input datasets and output datasets. I think scipy.spatial.distance.cdist may be the best way to calculate the distances, but I'm not sure about the best way to link these to the input data as shown in new_df – COIh0rp Nov 18 '22 at 09:39
  • I'm not sure what the difference is between your `new_df` and the result of a simple `.join()` operation. There doesn't seem to be any distance results there. – Daniel F Nov 18 '22 at 09:43
  • The distance result is in the Dist column as the Euclidean Distance. This needs to be calculated from every point in Dataset_1 to every point in Dataset_2 with the attributes from the point data preserved in the new_df – COIh0rp Nov 18 '22 at 09:46
  • The current `data_type` is not efficient. Arrays of structures (AoS) are known to be inefficient, especially when some fields are pretty large and some other can benefit from vectorization. See [this post](https://stackoverflow.com/questions/71101579). This is especially important here since Numpy do not implement structured types efficiently. On top of that, you use unicode strings which are also known to be slow to compute. Consider using ASCII ones if you know they contain ASCII characters. Also note that Numpy pre-reserves the space for strings of 160 32-bit characters so `640*X*Y` bytes. – Jérôme Richard Nov 18 '22 at 17:04
  • Finally, computing all the distances is like a brute-force method. This is generally not strictly required. In many cases, you can use a KD-tree, Quad-tree, Ball-tree not to compute all distances. This results in a `O(n log n)` complexity rather than a `O(n²)` one. All of this can results in a many order of magnitude faster code for sufficiently large data (assuming the vectorization is done correctly). – Jérôme Richard Nov 18 '22 at 17:12

1 Answers1

1

Two methods. Setup:

import numpy as np
import pandas as pd
import string
from scipy.spatial.distance import cdist

upper = list(string.ascii_uppercase)
lower = list(string.ascii_lowercase)

df1 = pd.DataFrame(np.random.rand(26,3), 
                   columns = lower[-3:], 
                   index = lower )

df2 = pd.DataFrame(np.random.rand(25,3), 
                   columns = lower[-3:], 
                   index = upper[:-1] )  #testing different lengths

Using .merge(*, how='cross'), this gives your intended output I think

new_df = df1.reset_index().merge(df2.reset_index(), 
                              how = 'cross',
                              suffixes = ['1', '2'])
new_df['dist'] = cdist(df1, df2).flatten()

A 2D 'ravelled' method that maintains the original data as MultiIndexes:

new_df2 = pd.DataFrame(cdist(df1, df2), 
                   index = pd.MultiIndex.from_arrays(df1.reset_index().values.T, 
                                                     names = df1.reset_index().columns), 
                   columns = pd.MultiIndex.from_arrays(df2.reset_index().values.T, 
                                                     names = df2.reset_index().columns))
Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • Got the first method working with 1 minor modification: new_df['dist'] = cdist(ds1.loc[:, :'Z1'], ds2.loc[:, :'Z2']).flatten() as the cdist function needed to reference the coordinate columns – COIh0rp Nov 18 '22 at 10:26