I have a number of pickled pandas dataframes with a decent number of rows in each (~10k). One of the columns of the dataframe is a numpy ndarray of floats (Yes, I specifically chose to store array data inside a single cell - I've read this may not usually be the right way to go, eg. here , but in this case the individual values are meaningless, only the full list of values has meaning taken together, so I think it makes sense in this case). I need to calculate the euclidean distance between each pair of rows in the frame. I have working code for this, but I am hoping I can do something to improve the performance of it, as right now it is telling me that my smaller dataset is going to take > a month, but I'm pretty sure it's gonna take all my memory long before then.
The code is as follows:
import pandas as pd
import sys
import getopt
import math
from scipy.spatial import distance
from timeit import default_timer as timer
from datetime import timedelta
id_column_1 = 'id1'
id_column_2 = 'id2'
distance_column = 'distance'
val_column = 'val'
# where n is the size of the set
# and k is the number of elements per combination
def combination_count(n, k):
if k > n:
return 0
else:
# n! / (k! * (n - k)!)
return math.factorial(n)/(math.factorial(k) * math.factorial(n - k))
def progress(start, current, total, id1, id2):
if current == 0:
print('Processing combination #%d of #%d, (%d, %d)' % (current, total, id1, id2))
else:
percent_complete = 100 * float(current)/float(total)
elapsed_time = timer() - start
avg_time = elapsed_time / current
remaining = total - current
remaining_time = timedelta(seconds=remaining * avg_time)
print('Processing combination #%d of #%d, (%d, %d). %.2f%% complete, ~%.2f s/combination, ~%s remaining' % (current, total, id1, id2, percent_complete, avg_time, remaining_time))
def check_distances(df):
indexes = df.index
total_combinations = combination_count(len(indexes), 2)
current_combination = 0
print('There are %d possible inter-message relationships to compute' % total_combinations)
distances = pd.DataFrame(columns=[id_column_1, id_column_2, distance_column])
distances.set_index([id_column_1, id_column_2], inplace=True)
start = timer()
for id1 in indexes:
for id2 in indexes:
# id1 is always < id2
if id1 >= id2:
continue
progress(start, current_combination, total_combinations, id1, id2)
distances.loc[(id1, id2), distance_column] = distance.euclidean(df.loc[id1, embeddings_column], df.loc[id2, embeddings_column])
current_combination+=1
(I excluded the main() function which just pulls out args and loads in the pickled files based on them)
I've only really started working with Python recently for this task, so there's every chance I'm missing something simple, is there a good way to deal with this?