0

I'm calculating the Euclidean distance between all rows in a large data frame.

This code works:

from scipy.spatial.distance import pdist,squareform

distances = pdist(df,metric='euclidean')
dist_matrix = squareform(distances)
pd.DataFrame(dist_matrix).to_csv('distance_matrix.txt')

And this prints out a matrix like this:

    0     1     2
0 0.0   4.7   2.3
1 4.7   0.0   3.3
2 2.3   3.3   0.0

But there's a lot of redundant calculating happening (e.g. the distance between sequence 1 and sequence 2 is getting a score....and then the distance between sequence 2 and sequence 1 is getting the same score).

Would someone know a more efficient way of calculating the Euclidean distance between the rows in a big data frame, non-redundantly (i.e. the dataframe is about 35gb)?

Slowat_Kela
  • 1,377
  • 2
  • 22
  • 60
  • 5
    According to [this answer](https://stackoverflow.com/a/13079806/9274732), `pdist` seems to calculate only the distance 1-2 and not 2-1. Then is it when using `squareform` that the value for the distance 2-1 is "created", but not calculated. Was it your question or I misunderstood? Also, maybe faster would be to pass `df.to_numpy()` in `pdist` – Ben.T Feb 07 '22 at 15:42
  • 3
    @Ben.T is correct; `pdist` is not doing redundant calculations. Take a look at `distances` before you pass it to `squareform`. It is a 1-d array of the non-redundant distance calculations (as explained in the linked answer in the above comment). `squareform` copies the values necessary to make the symmetric distance matrix, but it does not recompute any distances. – Warren Weckesser Feb 07 '22 at 17:21

0 Answers0