scipy pdist variation performance boosting by applying to all pairs

Question

I'm trying to apply a variation of pairwise euclidean distance calculation on a pandas dataframe to generate an edge list.

A standard euclidean distance calculation can be:

from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
import pandas as pd
import numpy as np
# after dataframe is loaded
d_array = pdist(df, 'euclidean')
d_df = pd.DataFrame(squareform(d_array), index=df.index, columns= df.index)
d_df = d_df.where(np.triu(np.ones(d_df.shape)).astype(np.bool))
edge_list = d_df.stack()

and I made a variation of the euclidean distance calculation:

from math import *
import itertools

def arc_sub(a, b):
    HALF_CIRCUM = 180
    l = max(a, b)
    s = min(a, b)
    if l - s > HALF_CIRCUM:
        return s + HALF_CIRCUM * 2 - l
    else:
        return l - s

def arc_dist(df, pair):
    df_pair = df.loc[pair, :]
    x = df_pair.loc[pair[0], :]
    y = df_pair.loc[pair[1], ]
    return sqrt(sum(pow(arc_sub(a, b), 2) for a, b in zip(x, y)))

pairs = list(itertools.combinations(list(df.index), 2))
edge_list = pd.DataFrame([arc_dist(df, pair) for pair in pairs], index=pairs)

It seems that my variation is much slower than pdist of scipy and I guess it is due to the looping through the pair list. It takes less memory than pdist and I guess pdist applies the calculation to all pairs at once. Is there any way I can apply my distance calculation to all pairs, too?

can you provide small (3-5 rows) but reproducible sample data set and desired resulting data set? — MaxU - stand with Ukraine, Mar 18 '17 at 14:52
[Here is an example of using `pdist` with custom function](http://stackoverflow.com/a/42881101/5741205). BTW please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — MaxU - stand with Ukraine, Mar 19 '17 at 08:52

scipy pdist variation performance boosting by applying to all pairs

0 Answers0