8

How can I calculate the Euclidean distance between all the rows of a dataframe? I am trying this code, but it is not working:

zero_data = data
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
result.head()

This is how my (44062 by 278) dataframe looks like:

Please see sample data here

Andreas K.
  • 9,282
  • 3
  • 40
  • 45
Quicklearner.gk
  • 133
  • 1
  • 1
  • 8
  • Can you try `zero_data.apply(lambda row: distance(*row.values), axis=1)` – DOOM Mar 07 '20 at 06:56
  • TypeError: ('() takes 2 positional arguments but 44062 were given', 'occurred at index 0') this is the error i am getting – Quicklearner.gk Mar 07 '20 at 07:01
  • its for euclidean distance – Quicklearner.gk Mar 07 '20 at 07:13
  • Please read [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) article on how to post a reproducible `pandas` question. You have not provided any samples of data for us to work with. – Ukrainian-serge Mar 07 '20 at 07:37
  • @Quicklearner what are the column names in your dataframe are u using in your computation? – DOOM Mar 07 '20 at 07:44
  • It’s 8,8 and 6,7 and 7,7 and so on based on bigram – Quicklearner.gk Mar 07 '20 at 07:49
  • So between which columns do you want to compute the distance? Between all of them? – Andreas K. Mar 07 '20 at 07:58
  • I have this dataframe of 44062 by 278 and I have to find Euclidean distance between users actual data is my user – Quicklearner.gk Mar 07 '20 at 08:09
  • If you have 3 users a,b,c then you want to find 3 distances between each of them, a-b, a-c, b-c? – Andreas K. Mar 07 '20 at 08:11
  • zero_data = data distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2) result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2))) result.head() This is giving result based on column in my dataframe but I want row wise – Quicklearner.gk Mar 07 '20 at 08:14
  • @Quicklearner so if i understand this user **0** is the row with `index = 0` , user **1** is the row with `index = 1` ... and so-on.. and you want to calculate the `np.linalg.norm` b/w all **user 0** and **user 1**, **user 0** and **user 1**, **user 0** and **user 2**, ... **user 0** and **user n**.. – DOOM Mar 07 '20 at 08:22
  • Yes right you are correct – Quicklearner.gk Mar 07 '20 at 08:24

2 Answers2

9

To compute the Eucledian distance between two rows i and j of a dataframe df:

np.linalg.norm(df.loc[i] - df.loc[j])

To compute it between consecutive rows, i.e. 0 and 1, 1 and 2, 2 and 3, ...

np.linalg.norm(df.diff(axis=0).drop(0), axis=1)

If you want to compute it between all the rows, i.e. 0 and 1, 0 and 2, ..., 1 and 1, 1 and 2 ..., then you have to loop through all the combinations of i and j (keep in mind that for 44062 rows there are 970707891 such combinations so using a for-loop will be very slow):

import itertools

for i, j in itertools.combinations(df.index, 2):
    d_ij = np.linalg.norm(df.loc[i] - df.loc[j])

Edit:

Instead, you can use scipy.spatial.distance.cdist which computes distance between each pair of two collections of inputs:

from scipy.spatial.distance import cdist

cdist(df, df, 'euclid')

This will return you a symmetric (44062 by 44062) matrix of Euclidian distances between all the rows of your dataframe. The problem is that you need a lot of memory for it to work (at least 8*44062**2 bytes of memory, i.e. ~16GB). So a better option is to use pdist

from scipy.spatial.distance import pdist

pdist(df.values, 'euclid')

which will return an array (of size 970707891) of all the pairwise Euclidean distances between the rows of df.

P.s. Don't forget to ignore the 'Actual_data' column in the computations of distances. E.g. you can do the following: data = df.drop('Actual_Data', axis=1).values and then cdist(data, data, 'euclid') or pdist(data, 'euclid'). You can also create another dataframe with distances like this:

data = df.drop('Actual_Data', axis=1).values

d = pd.DataFrame(itertools.combinations(df.index, 2), columns=['i','j'])
d['dist'] = pdist(data, 'euclid')


   i  j  dist
0  0  1  ...
1  0  2  ...
2  0  3  ...
3  0  4  ...
...
Andreas K.
  • 9,282
  • 3
  • 40
  • 45
2

Working with a subset of your data for eg.

df_data = [[888888, 3, 0, 0],
 [677767, 0, 2, 1],
 [212341212, 0, 0, 0],
 [141414141414, 0, 0, 0],
 [1112224, 0, 0, 0]]

# Creating the data
df = pd.DataFrame(data=data, columns=['Actual_Data', '8,8', '6,6', '7,7'], dtype=np.float64)

# Which looks like
#     Actual_Data  8,8  6,6  7,7
# 0  8.888880e+05  3.0  0.0  0.0
# 1  6.777670e+05  0.0  2.0  1.0
# 2  2.123412e+08  0.0  0.0  0.0
# 3  1.414141e+11  0.0  0.0  0.0
# 4  1.112224e+06  0.0  0.0  0.0

# Computing the distance matrix
dist_matrix = df.apply(lambda row: [np.linalg.norm(row.values - df.loc[[_id], :].values, 2) for _id in df.index.values], axis=1)

# Which looks like
# 0     [0.0, 211121.00003315636, 211452324.0, 141413252526.0, 223336.000020149]
# 1    [211121.00003315636, 0.0, 211663445.0, 141413463647.0, 434457.0000057543]
# 2                 [211452324.0, 211663445.0, 0.0, 141201800202.0, 211228988.0]
# 3        [141413252526.0, 141413463647.0, 141201800202.0, 0.0, 141413029190.0]
# 4      [223336.000020149, 434457.0000057543, 211228988.0, 141413029190.0, 0.0]

# Reformatting the above into readable format
dist_matrix = pd.DataFrame(
  data=dist_matrix.values.tolist(), 
  columns=df.index.tolist(), 
  index=df.index.tolist())

# Which gives you
#               0             1             2             3             4
# 0  0.000000e+00  2.111210e+05  2.114523e+08  1.414133e+11  2.233360e+05
# 1  2.111210e+05  0.000000e+00  2.116634e+08  1.414135e+11  4.344570e+05
# 2  2.114523e+08  2.116634e+08  0.000000e+00  1.412018e+11  2.112290e+08
# 3  1.414133e+11  1.414135e+11  1.412018e+11  0.000000e+00  1.414130e+11
# 4  2.233360e+05  4.344570e+05  2.112290e+08  1.414130e+11  0.000000e+00

Update

as pointed out in the comments the issue is memory overflow so we have to operate the problem in batches.

# Collecting the data
# df = ....

# Set this number to a lower value if you get the same `memory` errors.
batch = 200 # #'s of row's / user's used to compute the matrix

# To be conservative, let's write the intermediate results to file type.
dffname = []

for ifile,_slice in enumerate(np.array_split(range(df.shape[0]), batch)):

  # Let's compute distance for `batch` #'s of points in data frame
  tmp_df = df.iloc[_slice, :].apply(lambda row: [np.linalg.norm(row.values - df.loc[[_id], :].values, 2) for _id in df.index.values], axis=1)

  tmp_df = pd.DataFrame(tmp_df.values.tolist(), index=df.index.values[_slice], columns=df.index.values)

  # You can change it from csv to any other files
  tmp_df.to_csv(f"{ifile+1}.csv")
  dffname.append(f"{ifile+1}.csv")

# Reading back the dataFrames
dflist = []
for f in dffname:
  dflist.append(pd.read_csv(f, dtype=np.float64, index_col=0))

res = pd.concat(dflist)
DOOM
  • 1,170
  • 6
  • 20
  • I think the column 'Actual_data' should be ignored in the computations. Also you should take into account that there are 44062 rows in the df, and your solution will be very slow. – Andreas K. Mar 07 '20 at 13:58
  • MemoryError Traceback (most recent call last) in 1 from scipy.spatial.distance import cdist 2 ----> 3 cdist(a, a, 'euclid') ~\Anaconda3\lib\site-packages\scipy\spatial\distance.py in cdist(XA, XB, metric, *args, **kwargs) 2727 out = kwargs.pop("out", None) 2728 if out is None: -> 2729 dm = np.empty((mA, mB), dtype=np.double) 2730 else: 2731 if out.shape != (mA, mB): after applying this i am getting the above error cdist(a, a, 'euclid') – Quicklearner.gk Mar 07 '20 at 15:27
  • @Quicklearner you are out of memory, as I wrote in my answer... a 44062 by 44062 array of floats requires at least around 16GB of memory. Use pdist instead which requires half the memory. – Andreas K. Mar 07 '20 at 16:34