I'm calculating the Euclidean distance between all rows in a large data frame.
This code works:
from scipy.spatial.distance import pdist,squareform
distances = pdist(df,metric='euclidean')
dist_matrix = squareform(distances)
pd.DataFrame(dist_matrix).to_csv('distance_matrix.txt')
And this prints out a matrix like this:
0 1 2
0 0.0 4.7 2.3
1 4.7 0.0 3.3
2 2.3 3.3 0.0
But there's a lot of redundant calculating happening (e.g. the distance between sequence 1 and sequence 2 is getting a score....and then the distance between sequence 2 and sequence 1 is getting the same score).
Would someone know a more efficient way of calculating the Euclidean distance between the rows in a big data frame, non-redundantly (i.e. the dataframe is about 35gb)?