6

I have a pandas dataframe that looks as follows:

In [23]: dataframe.head()
Out[23]: 
column_id   1  10  11  12  13  14  15  16  17  18 ...  46  47  48  49   5  50  \
row_id                                            ...                           
1         NaN NaN   1   1   1   1   1   1   1   1 ...   1   1 NaN   1 NaN NaN   
10          1   1   1   1   1   1   1   1   1 NaN ...   1   1   1 NaN   1 NaN   
100         1   1 NaN   1   1   1   1   1 NaN   1 ... NaN NaN   1   1   1 NaN   
11        NaN   1   1   1   1   1   1   1   1 NaN ... NaN   1   1   1   1   1   
12          1   1   1 NaN   1   1   1   1 NaN   1 ...   1 NaN   1   1 NaN   1   

The thing is I'm currently using the Pearson correlation to calculate similarity between rows, and given the nature of the data, sometimes std deviation is zero (all values are 1 or NaN), so the pearson correlation returns this:

In [24]: dataframe.transpose().corr().head()
Out[24]: 
row_id   1  10  100  11  12  13  14  15  16  17 ...  90  91  92  93  94  95  \
row_id                                          ...                           
1      NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
10     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
100    NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
11     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
12     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN    

Is there any other way of computing correlations that avoids this? Maybe an easy way to calculate the euclidean distance between rows with just one method, just as Pearson correlation has?

Thanks!

A.

misterte
  • 977
  • 1
  • 11
  • 21
  • Just change the NaNs to zeros? I'm not sure what that would mean or what you're trying to do in the first place, but that would be some sort of correlation measure I suppose. – JohnE Apr 18 '15 at 22:43
  • I assume you meant dataframe.fillna(0), not .corr().fillna(0). The thing is that this won't work properly with similarities/recommendations right out of the box. Maybe I can use that in combination with some boolean mask. Thanks for the suggestion. – misterte Apr 18 '15 at 22:50
  • Yeah, that's right. I don't even know what it would mean to have correlation/distance/whatever when you only have one possible non-NaN value. At least all ones and zeros has a well-defined meaning. I mean, your #1 issue here is what does it even mean to have a matrix of ones and NaNs? Whether you want a correlation or distance is issue #2. – JohnE Apr 18 '15 at 22:58
  • Tried it and it really messes up things. Now if you get two rows with 1 match they will have len(cols)-1 miss matches, instead of only differing in non-NaN values. Results are way different. Thanks anyway. – misterte Apr 18 '15 at 23:00
  • 3
    No worries. You may want to post a smaller but complete sample dataset (like 5x3) and example of results that you are looking for. I still can't guess what you are looking for, other than maybe a count of matches but I'm not sure exactly how you count a match vs non-match. – JohnE Apr 18 '15 at 23:05

4 Answers4

15

The key question here is what distance metric to use.

Let's say this is your data.

>>> import pandas as pd
>>> data = pd.DataFrame(pd.np.random.rand(100, 50))
>>> data[data > 0.2] = 1
>>> data[data <= 0.2] = pd.np.nan
>>> data.head()
   0   1   2   3   4   5   6   7   8   9  ...  40  41  42  43  44  45  46  47  \
0   1   1   1 NaN   1 NaN NaN   1   1   1 ...   1   1 NaN   1 NaN   1   1   1
1   1   1   1 NaN   1   1   1   1   1   1 ... NaN   1   1 NaN NaN   1   1   1
2   1   1   1   1   1   1   1   1   1   1 ...   1 NaN   1   1   1   1   1 NaN
3   1 NaN   1 NaN   1 NaN   1 NaN   1   1 ...   1   1   1   1 NaN   1   1   1
4   1   1   1   1   1   1   1   1 NaN   1 ... NaN   1   1   1   1   1   1   1

What is the % difference?

You can compute a distance metric as percentage of values that are different between each column. The result shows the % difference between any 2 columns.

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: (column1 - column2).abs().sum() / len(column1)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
     0     1     2     3     4     5     6     7     8     9   ...     40  \
0  0.00  0.36  0.33  0.37  0.32  0.41  0.35  0.33  0.39  0.33  ...   0.37
1  0.36  0.00  0.37  0.29  0.30  0.37  0.33  0.37  0.33  0.31  ...   0.35
2  0.33  0.37  0.00  0.36  0.29  0.38  0.40  0.34  0.30  0.28  ...   0.28
3  0.37  0.29  0.36  0.00  0.29  0.30  0.34  0.26  0.32  0.36  ...   0.36
4  0.32  0.30  0.29  0.29  0.00  0.31  0.35  0.29  0.29  0.25  ...   0.27

What is the correlation coefficient?

Here, we use the Pearson correlation coefficient. This is a perfectly valid metric. Specifically, it translates to the phi coefficient in case of binary data.

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: scipy.stats.pearsonr(column1, column2)[0]
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
         0         1         2         3         4         5         6   \
0  1.000000  0.013158  0.026262 -0.059786 -0.024293 -0.078056  0.054074
1  0.013158  1.000000 -0.093109  0.170159  0.043187  0.027425  0.108148
2  0.026262 -0.093109  1.000000 -0.124540 -0.048485 -0.064881 -0.161887
3 -0.059786  0.170159 -0.124540  1.000000  0.004245  0.184153  0.042524
4 -0.024293  0.043187 -0.048485  0.004245  1.000000  0.079196 -0.099834

Incidentally, this is the same result that you would get with the Spearman R coefficient as well.

What is the Euclidean distance?

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
         0         1         2         3         4         5         6   \
0  0.000000  6.000000  5.744563  6.082763  5.656854  6.403124  5.916080
1  6.000000  0.000000  6.082763  5.385165  5.477226  6.082763  5.744563
2  5.744563  6.082763  0.000000  6.000000  5.385165  6.164414  6.324555
3  6.082763  5.385165  6.000000  0.000000  5.385165  5.477226  5.830952
4  5.656854  5.477226  5.385165  5.385165  0.000000  5.567764  5.916080

By now, you'd have a sense of the pattern. Create a distance method. Then apply it pairwise to every column using

data.apply(lambda col1: data.apply(lambda col2: method(col1, col2)))

If your distance method relies on the presence of zeroes instead of nans, convert to zeroes using .fillna(0).

S Anand
  • 11,364
  • 2
  • 28
  • 23
  • 1
    This is a very good answer and it definitely helps me with what I'm doing. Thanks for that. As a bonus, I still see different recommendation results when using fillna(0) with Pearson correlation. This is because in some cases it's not just NaNs and 1s, but other integers, which gives a std>0. Do you know of any way to account for this? – misterte Apr 19 '15 at 22:35
  • 1
    zero_data = df.fillna(0) distance = lambda column1, column2: ((column1 == column2).astype(int).sum() / column1.sum())/((np.logical_not(column1) == column2).astype(int).sum()/(np.logical_not(column1).sum())) result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2))) result.head() – plotti Mar 10 '16 at 21:20
  • 1
    How to do the same for rows instead of columns? – Sigur Aug 29 '17 at 19:10
  • 3
    @Sigur - `data.T` transposes rows and columns. You could just replace `data` with `data.T` – S Anand Aug 31 '17 at 02:54
2

A proposal to improve the excellent answer from @s-anand for Euclidian distance: instead of

zero_data = data.fillna(0)
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)

we can apply the fillna the fill only the missing data, thus:

distance = lambda column1, column2: pd.np.linalg.norm((column1 - column2).fillna(0))

This way, the distance on missing dimensions will not be counted.

maparent
  • 31
  • 3
  • This is not a good idea imho: the distance algorithm will be applied on each combination of columns... therefore it will be calculated a lot more often in the worst case... – Nico Albers Jan 10 '17 at 23:37
0

This is my numpy-only version of @S Anand's fantastic answer, which I put together in order to help myself understand his explanation better.

Happy to share it with a short, reproducible example:

# Preliminaries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Get iris dataset into a DataFrame
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

Let's try scipy.stats.pearsonr first.

Executing:

distance = lambda column1, column2: pearsonr(column1, column2)[0]
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: distance(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt

returns: enter image description here

and:

rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: pearsonr(col1, col2)[0], 
                                                               axis = 0, arr=iris_df), 
                              axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np

returns:

array([[1.00, -0.12, 0.87, 0.82, 0.78],
       [-0.12, 1.00, -0.43, -0.37, -0.43],
       [0.87, -0.43, 1.00, 0.96, 0.95],
       [0.82, -0.37, 0.96, 1.00, 0.96],
       [0.78, -0.43, 0.95, 0.96, 1.00]])

As a second example let's try the distance correlation from the dcor library.

Executing:

import dcor
dist_corr = lambda column1, column2: dcor.distance_correlation(column1, column2)
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: dist_corr(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt 

returns: enter image description here

while:

rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, col2), 
                                                               axis = 0, arr=iris_df), 
                              axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np

returns:

array([[1.00, 0.31, 0.86, 0.83, 0.78],
       [0.31, 1.00, 0.54, 0.51, 0.51],
       [0.86, 0.54, 1.00, 0.97, 0.95],
       [0.83, 0.51, 0.97, 1.00, 0.95],
       [0.78, 0.51, 0.95, 0.95, 1.00]])
MyCarta
  • 808
  • 2
  • 12
  • 37
0

I compared 3 variants from the other answers here for their speed. I had a trial 1000x25 matrix (leading to resulting 1000x1000 matrix)

  1. dcor library

Time: 0.03s

https://dcor.readthedocs.io/en/latest/functions/dcor.distances.pairwise_distances.html

import dcor

result = dcor.distances.pairwise_distances(data)
  1. scipy.distance

Time: 0.05s

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html

from scipy.spatial import distance_matrix

result = distance_matrix(data, data)
  1. using lambda function and numpy or pandas

Time: 180s / 90s

import numpy as np # variant A (180s)
import pandas as pd # variant B (90s)

distance = lambda x, y: np.sqrt(np.sum((x - y) ** 2)) # variant A
distance = lambda x, y: pd.np.linalg.norm(x - y) # variant B

result = data.apply(lambda x: data.apply(lambda y: distance(x, y), axis=1), axis=1)
Michal Skop
  • 1,349
  • 1
  • 15
  • 23