how to compute correlation coefficient for multi-variable 1 column

Question

Category  SubCategory  Month  Value
A         A1           Jan     1
A         A1           Feb     2
A         A1           Mar     3
A         A2           Jan     2
A         A2           Feb     3
A         A2           Mar     5
B         B1           Jan     1
B         B1           Feb     6
B         B1           Mar     7
B         B2           Jan     3
B         B2           Feb     6
B         B2           Mar     7

I have a sample pandas df like this. I would like to compute the correlation coefficient between Subgroup Categories A1 and A2, B1 and B2, but not A1 and B1 etc. My end goal is to have a table such as this:

    A1        A2        B1     B2
A1  1.0000  0.9820      
A2  0.9820  1.0000      
B1                    1.0000    0.9963
B2                    0.9963    1.0000

Can anyone help me with python code?

Obviously this one gives me a corr value of 1 for each SubCategory

df.groupby('SubCategory').corr()

OP's data is not normalized, so it has the potential to be inconsistent. So solutions would rely on the fact that data is valid, e.g., if one of the "Feb" be replaced by a "Apr", the solution by *W-B* would break, while that by *indominus* would give a result without detecting the error (also potential dangerous) — Indominus, Dec 13 '18 at 06:26

score 3 · Answer 1 · answered Dec 13 '18 at 04:43

3

1st it is pivot problem , then just using corr

pd.concat([x.pivot('Month','SubCategory','Value').corr() for _,x in df.groupby('Category')])
                   A1        A2        B1        B2
SubCategory                                        
A1           1.000000  0.981981       NaN       NaN
A2           0.981981  1.000000       NaN       NaN
B1                NaN       NaN  1.000000  0.996271
B2                NaN       NaN  0.996271  1.000000

answered Dec 13 '18 at 04:43

BENY

317,841
20
164
234

may I know why Month is index? and x[month].values are not displayed in index – Pyd Dec 13 '18 at 04:58
thank you for the answer, it works, however for the real-time application, I have 4700x 4700 matrix with mostly empty/NAN values. Is there any way to group these (create new dfs by category?) – rookiescientist Dec 13 '18 at 05:07
Just skip the final `pd.concat`, and you'll get 2 separate dataframes, one for each category. – tel Dec 13 '18 at 05:55
@rookiescientist just do not do concat – BENY Dec 13 '18 at 06:39

Indominus · Answer 2 · 2018-12-14T00:23:31.047

Data

import pandas as pd
df = pd.DataFrame({"Category" :   ["A", "A", "A", "A", "A", "A", 
                                   "B", "B", "B", "B", "B", "B"], 
                   "SubCategory": ["A1", "A1", "A1", "A2", "A2", "A2", 
                                   "B1", "B1", "B1", "B2", "B2", "B2"],
                   "Value":       [1, 2, 3, 2, 3, 5, 
                                   1, 6, 7, 3, 6, 7]})

Solution

import scipy as sp
# this will contain a list of DataFrames storing the correlation matrices
correlations = []
for g in df.groupby("Category"):
    sub_df = g[1][["SubCategory", "Value"]]
    data = sub_df.pivot_table(columns="SubCategory", values="Value", aggfunc=list)
    correlation = pd.DataFrame(sp.corrcoef(data.values.tolist()[0]), 
                               columns=data.columns.values.tolist(), 
                               index=data.columns.values.tolist())
    correlations.append(correlation)
pd.concat(correlations, sort=False)

Output

    A1          A2          B1          B2
________________________________________________
A1  1.000000    0.996271    NaN         NaN
A2  0.996271    1.000000    NaN         NaN
B1  NaN         NaN         1.000000    0.996271
B2  NaN         NaN         0.996271    1.000000

Update

This solution was tested on python and pandas versions as shown below, older versions may not work:

from platform import python_version
print('python version:', python_version())
import pandas as pd
print('pandas version:', pd.__version__)

    python version: 3.7.0
    pandas version: 0.23.4

i copy pasted both of your code and it gives me this error: ValueError: Shape of passed values is (2, 2), indices imply (2, 1) — rookiescientist, Dec 13 '18 at 16:32
on this line: data = sub_df.pivot_table(columns="SubCategory", values="Value", aggfunc=list) — rookiescientist, Dec 13 '18 at 16:48
hmm, I just got the same error on my phone, but I swear I tested on my desktop yesterday and it worked. the only difference might be my desktop has python 3.7 and my phone has 3.6, possibly different versions of pandad too. I'll check when I get home in the afternoon — Indominus, Dec 13 '18 at 16:57
I just restarted my desktop and tried the code again from fresh, and it still works. I suspect it is the versions of packages. May I ask what versions of python and pandas? Mine is 3.7.0 and 0.23.4. I just did a quick check on github, seems `pivot_table` was added in pandas 0.20. — Indominus, Dec 14 '18 at 00:17

how to compute correlation coefficient for multi-variable 1 column

2 Answers2