1

I am trying to find Euclidean distance between two points. I have around 13000 number of rows in Dataframe. I have to find Euclidean distance for each each row against all 13000 number of rows and then get the similarity scores for that. Running the code is more time consuming (more than 24 hrs).

Below is my code:

# Empty the existing database
df_similar = pd.DataFrame()
print(df_similar)

# 'i' refers all id's in the dataframe
# Length of df_distance is 13000

for i in tqdm(range(len(df_distance))):
    df_50 = pd.DataFrame(columns=['id', 'id_match', 'similarity_distance'])

    # in Order to avoid the duplicates we each time assign the "index" value with "i" so that we are starting the 
    # comparision from that index of "i" itself.
    if i < len(df_distance):
        index = i

    # This loop is used to iterate one id with all 13000 id's. This is time consuming as we have to iterate each id against all 13000 id's 
    for j in (range(len(df_distance))):

        # "a" is the id we are comparing with
        a = df_distance.iloc[i,2:]        

        # "b" is the id we are selecting to compare with
        b = df_distance.iloc[index,2:]

        value = euclidean_dist(a,b)

        # Create a temp dictionary to load the data into dataframe
        dict = {
            'id': df_distance['id'][i], 
            'id_match': df_distance['id'][index], 
            'similarity_distance':value
        }


        df_50 = df_50.append(dict,ignore_index=True)

        # if the b values are less (nearer to the end of the array)
        # in that case we reset the "index" value to 0 so as to continue the comparsision of "b" with "a".
        if index == len(df_distance)-1:
            index = 0
        else:
            index +=1

    # Append the content of "df_50" into "df_similar" once for the iteration of "i"
    df_similar = df_similar.append(df_50,ignore_index=True)

I guess more time consuming for me is in the for Loops.

Euclidean distance function I am using in my code.

from sklearn.metrics.pairwise import euclidean_distances
def euclidean_dist(a, b):
        euclidean_val = euclidean_distances([a, b])
        value = euclidean_val[0][1]
        return value

Sample df_distance data Note: In the image the values are scaled from column locality till end and we are using only this values to calculate the distance

enter image description here

Output format to be in this format below. enter image description here

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Mahsaga
  • 31
  • 4
  • I think this may answer your question. https://stackoverflow.com/questions/32946241/scipy-pdist-on-a-pandas-dataframe – rayryeng Apr 24 '22 at 14:25
  • what is `tqdm`? – Stuart Apr 24 '22 at 14:32
  • can you show some example data in `df_distance` and the `euclidean_dist` function? – Stuart Apr 24 '22 at 14:36
  • @Stuart tqdm is used to display progress bar in jupyter notebook. – Mahsaga Apr 24 '22 at 16:56
  • @Stuart I have added the snapshot in the question and euclidean_dist function is also added there now. Thanks! – Mahsaga Apr 24 '22 at 17:01
  • 1) is there a maximum distance beyond which you can discard the result? (this can make things much faster) 2) Do you want a full 13000x13000 distance matrix, or is the (13000x12999)/2 upper triangle ok (since distance from i to j is same as distance from j to i)? 3) `numpy` only or can you use other packages? – Daniel F Apr 25 '22 at 06:18
  • You might find [my answer here](https://stackoverflow.com/questions/52366421/how-to-do-n-d-distance-and-nearest-neighbor-calculations-on-numpy-arrays/52366706#52366706) useful in any case. – Daniel F Apr 25 '22 at 07:27
  • @DanielF thanks for your reply. 1) I want it to run for all 13000 first then sort the similarities in ascending order and take the top 50 from that. 2) I just want a value to be returned not required matrix. 3) yeah other packages are also ok – Mahsaga Apr 25 '22 at 09:50
  • How many columns are there? 50 nearest neighbors should be OK to use `KDTree`, but they are only efficient for `2**columns < rows`. In this case, you'd need <13 columns – Daniel F Apr 25 '22 at 10:38

2 Answers2

3

try using numpy instead, do some thing like this:

import pandas as pd
import numpy as np 

def numpy_euclidian_distance(point_1, point_2):
    array_1, array_2 = np.array(point_1), np.array(point_2)
    squared_distance = np.sum(np.square(array_1 - array_2))
    distance = np.sqrt(squared_distance)
    return distance 
    
    
# initialise data of lists.
data = {'num1':[1, 2, 3, 4], 'num2':[20, 21, 19, 18]}
 
# # Create DataFrame
df = pd.DataFrame(data)

# calculate distance of the hole number at ones using numpy 
distance = numpy_euclidian_distance(df.iloc[:,0],df.iloc[:,1])
print(distance)
bara-elba
  • 146
  • 6
  • This will still require looping through. – Onyambu Apr 24 '22 at 14:59
  • The problem is that OP has 13000 rows. Your code is to calculate the distance between row1 and row 2. Not between every row to every other row as OP requested – Onyambu Apr 24 '22 at 15:16
  • the Answer provided below might solve the issue – Onyambu Apr 24 '22 at 15:17
  • @onyambu yes you right the problem for me is in calculating it for 13000 rows – Mahsaga Apr 24 '22 at 17:10
  • 1
    @Mahsaga your code can be optimized and calculate less than half of what you did. you should know that : the distance between 2 points a and b is the same as the distance between b and a. so you only need to calculate 1 of them. – bara-elba Apr 26 '22 at 11:16
1

OK, so from comments I take it you want the top 50 distances, which is faster as a single step using KDTree. As a warning, KDTree will only be faster than brute force for columns**2 < rows, so of you have more then 13 rows, there may be faster ways to implement, but this will still likely be the simplest:

from scipy.spatial import KDTree
X = df_distance.values
X_tree = KDTree(X)
k_d, k_i = X_tree.query(X, k = 50)  # shape of each is (13k, 50)

k_i[i] will then be a list of indices of the 50 closest points to the point at index i with 0 <= i < 13000, and k_d[i] will be the corresponding distances.

EDIT: this should get the dataframe you want, using multi-index:

df_d = {
        idx: {
              df_distance['id'][k_i[i, j]]: d for j, d in enumerate(k_d[i])
              } for i, idx in enumerate(df_distance['id'])
        }
out = pd.dataframe(df_d).T
Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • thanks for your reply but I am not sure how to get the Output as in required format I have mentioned in question above. To get output in that format I am using nested for loops which is time consuming – Mahsaga Apr 25 '22 at 15:36
  • 1
    I think I have generated a dataframe, but haven't the time to test now. – Daniel F Apr 26 '22 at 08:54