-1

I have a pandas data frame like this. Where the index is pd.DatetimeIndex and the columns are timeseries.

x_1 x_2 x_3
2020-08-17 133.23 2457.45 -4676
2020-08-18 -982 -6354.56 -245.657
2020-08-19 5678.642 245.2786 2461.785
2020-08-20 -2394 154.34 -735.653
2020-08-20 236 -8876 -698.245

I need to calculate the Euclidean distance of all the columns against each other. I.e., (x_1 - x_2), (x_1 - x_3), (x_2 - x_3), and return a square data frame like this: (Please realize that the values in this table are just an example and not the actual result of the Euclidean distance)

x_1 x_2 x_3
x_1 0 123 456
x_2 123 0 789
x_3 456 789 0

I tried this resource but I could not figure out how to pass the columns of my df. If understand correctly the example passes the rows as the series to calculate the ED from.

Riley
  • 2,153
  • 1
  • 6
  • 16
rbrt
  • 27
  • 5
  • The fact you have a DatetimeIndex and the columns are timeseries seems irrelevant. You essentially have 3 points, in n-dimensional space (where n is the number of rows) and you want to calculate the euclidean distance, right? – Riley Oct 24 '21 at 23:56
  • Euclidean distance. I know I can do something like np.linalg.norm(x_1 - x_2). But I want to calculate all columns at the same time. The output should look like the second dataframe, although the numbers are just to illustrate how the df should be filled. – rbrt Oct 24 '21 at 23:56
  • @Riley yes I want the Euclidean distance of all the columns not rows. – rbrt Oct 24 '21 at 23:57

2 Answers2

1

An explicit way of achieving this would be:

from itertools import combinations

import numpy as np

dist_df = pd.DataFrame(index=df.columns, columns=df.columns)

for col_a, col_b in combinations(df.columns, 2):
    dist = np.linalg.norm(df[col_a] - df[col_b])
    dist_df.loc[col_a, col_b] = dist
    dist_df.loc[col_b, col_a] = dist

print(dist_df)

outputs

              x_1           x_2           x_3
x_1           NaN  12381.858429   6135.306973
x_2  12381.858429           NaN  12680.121047
x_3   6135.306973  12680.121047           NaN

If you want 0 instead of NaN use DataFrame.fillna:

dist_df.fillna(0, inplace=True)
DeepSpace
  • 78,697
  • 11
  • 109
  • 154
  • This code works but has a bug and I used itertools product instead of combinations. The reason is that combinations only makes pairs of distinct columns. And ED formula also computes cases like x_1 vs x_1. https://stackoverflow.com/questions/23833780/how-to-use-itertools-to-compute-all-combinations-with-repeating-elements – rbrt Oct 27 '21 at 21:25
  • @rbrt Not sure I see the bug. This code generates NaNs along the diagonal and later changes them to 0 (which is the expected distance) – DeepSpace Oct 28 '21 at 07:14
  • Yes, but by using itertools combinations it never computes x_1 vs x_1. I guess that is why it returns NaN. In fact I used the same logic to calculate Cosine Similarity in the same dataset, which expects 1s in the diagonal and it returns NaN. That's how I noticed something was off, and found that replacing "combinations" with "product" solved the issue. – rbrt Oct 29 '21 at 13:21
1

The following code will work, with any number of columns.

setup

df = pd.DataFrame(
    {
        "x1":[133.23, -982, 5678.642, -2394, 236],
        "x2":[2457.45, -6354.56, 245.2786, 154.34, -8876],
        "x3":[-4676, -245.657, 2461.785, -735.653, 698.245],
    }
)

solution

import numpy as np

aux = np.broadcast_to(df.values,  (df.shape[1], *df.shape))
result = np.sqrt(np.square(aux - aux.transpose()).sum(axis=1))

result is a numpy.array

You can wrap it up in a dataframe if you wish like this

pd.DataFrame(result, columns=df.columns, index=df.columns)

              x1            x2            x3
x1      0.000000  12381.858429   6081.352512
x2  12381.858429      0.000000  13622.626775
x3   6081.352512  13622.626775      0.000000

Why this approach works is beyond what I'm willing to go into and requires a strong math background. You will need to decide what is more important for you: speed, or readability/understandability.

Riley
  • 2,153
  • 1
  • 6
  • 16
  • *"Why this approach works is beyond what I'm willing to go into and requires a strong math background"* Huh? It almost literally contains the Euclidean distance formula. I'd remove this "fluff" from the answer, especially considering the fact OP already knows about `np.linalg.norm` – DeepSpace Oct 25 '21 at 00:21
  • It's involves broadcasting the matrices and calculating the euclidean distance between vectors using 3 dimensional matrices. It's not trivial. – Riley Oct 25 '21 at 00:26
  • This code works! However, I decided not to use it since it is beyond my skills (although I read about broadcast) and I wanted to be able to understand and explain the code. – rbrt Oct 27 '21 at 21:36
  • 1
    You can also check out scipy distance metrics https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist – Riley Oct 27 '21 at 21:45