I'm trying to apply a variation of pairwise euclidean distance calculation on a pandas dataframe to generate an edge list.
A standard euclidean distance calculation can be:
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
import pandas as pd
import numpy as np
# after dataframe is loaded
d_array = pdist(df, 'euclidean')
d_df = pd.DataFrame(squareform(d_array), index=df.index, columns= df.index)
d_df = d_df.where(np.triu(np.ones(d_df.shape)).astype(np.bool))
edge_list = d_df.stack()
and I made a variation of the euclidean distance calculation:
from math import *
import itertools
def arc_sub(a, b):
HALF_CIRCUM = 180
l = max(a, b)
s = min(a, b)
if l - s > HALF_CIRCUM:
return s + HALF_CIRCUM * 2 - l
else:
return l - s
def arc_dist(df, pair):
df_pair = df.loc[pair, :]
x = df_pair.loc[pair[0], :]
y = df_pair.loc[pair[1], ]
return sqrt(sum(pow(arc_sub(a, b), 2) for a, b in zip(x, y)))
pairs = list(itertools.combinations(list(df.index), 2))
edge_list = pd.DataFrame([arc_dist(df, pair) for pair in pairs], index=pairs)
It seems that my variation is much slower than pdist
of scipy
and I guess it is due to the looping through the pair list. It takes less memory than pdist
and I guess pdist
applies the calculation to all pairs at once. Is there any way I can apply my distance calculation to all pairs, too?