Pandas Correlation Groupby

Question

Assuming I have a dataframe similar to the below, how would I get the correlation between 2 specific columns and then group by the 'ID' column? I believe the Pandas 'corr' method finds the correlation between all columns. If possible I would also like to know how I could find the 'groupby' correlation using the .agg function (i.e. np.correlate).

What I have:

ID  Val1    Val2    OtherData   OtherData
A   5       4       x           x
A   4       5       x           x
A   6       6       x           x
B   4       1       x           x
B   8       2       x           x
B   7       9       x           x
C   4       8       x           x
C   5       5       x           x
C   2       1       x           x

What I need:

ID  Correlation_Val1_Val2
A   0.12
B   0.22
C   0.05

JohnE · Accepted Answer · 2018-07-25T09:36:17.670

You pretty much figured out all the pieces, just need to combine them:

>>> df.groupby('ID')[['Val1','Val2']].corr()

             Val1      Val2
ID                         
A  Val1  1.000000  0.500000
   Val2  0.500000  1.000000
B  Val1  1.000000  0.385727
   Val2  0.385727  1.000000

In your case, printing out a 2x2 for each ID is excessively verbose. I don't see an option to print a scalar correlation instead of the whole matrix, but you can do something simple like this if you only have two variables:

>>> df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2,-1]

ID       
A   Val1    0.500000
B   Val1    0.385727

For the more general case of 3+ variables

For 3 or more variables, it is not straightforward to create concise output but you could do something like this:

groups = list('Val1', 'Val2', 'Val3', 'Val4')
df2 = pd.DataFrame()
for i in range( len(groups)-1): 
    df2 = df2.append( df.groupby('ID')[groups].corr().stack()
                        .loc[:,groups[i],groups[i+1]:].reset_index() )

df2.columns = ['ID', 'v1', 'v2', 'corr']
df2.set_index(['ID','v1','v2']).sort_index()

Note that if we didn't have the groupby element, it would be straightforward to use an upper or lower triangle function from numpy. But since that element is present, it is not so easy to produce concise output in a more elegant manner as far as I can tell.

How would I change this to 'rolling_corr()' so that the rolling correlation is calculated every 10 days? — Gohawks, Mar 11 '15 at 22:52
This is a great answer. The fact that to something this simple involves having to mess around with `.iloc`s is one of the things which really frustrates me about pandas; if I want to set up a large processing pipeline for scientific data, I end up feeling like everything's held together with toothpaste. — Cai, Sep 25 '18 at 10:22

score 13 · Answer 2 · answered Nov 26 '19 at 10:12

13

One more simple solution:

df.groupby('ID')[['Val1','Val2']].corr().unstack().iloc[:,1]

answered Nov 26 '19 at 10:12

VovaM

342
4
8

score 6 · Answer 3 · answered Jul 25 '18 at 08:15

In the above answer; since ix has been depreciated use iloc instead with some minor other changes:

df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2][['Val2']] # to get pandas DataFrame

or

df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2]['Val2'] # to get pandas Series

NaWeeD · Answer 4 · 2022-07-14T19:27:58.623

If you also need to keep the structure of your dataframe here's what I came up with:

Assuming you start with this dataframe:

corr_df = x.groupby('ID')[['Val1','Val2']].corr().unstack()
corr_df.columns = ['Correlation_Val1_Val1', 'Correlation_Val1_Val2', 'Correlation_Val2_Val1', 'Correlation_Val2_Val2']
corr_df = corr_df.reset_index().drop(columns=['Correlation_Val1_Val1', 'Correlation_Val2_Val2', 'Correlation_Val2_Val1']) 'v2v1cor', 'v2v2cor'])

which results in:

score 0 · Answer 5 · answered Nov 18 '22 at 15:55

Combining other's answers to something that does not rely on implicit indices.

Getting correlation by group is relatively simple:

corr_df = op_data\
    .groupby('ID')[['Val1', 'Val2']]\
    .corr().unstack()[[('Val1', 'Val2')]]\
    .reset_index()

  ID      Val1
          Val2
0  A  0.500000
1  B  0.385727
2  C  0.714575

Now, to get the desired flat column structure we can use some tricks

corr_df.columns = corr_df.columns.map(lambda x: '_'.join(a for a in x if len(a)>0))

  ID  Val1_Val2
0  A   0.500000
1  B   0.385727
2  C   0.714575

p.s. to parse OPs data:

for rep in range(10):
    data_str = data_str.replace('  ', ' ')
op_data = pd.read_csv(io.StringIO(data_str), sep=' ', engine='python')

Asim Noor · Answer 6 · 2022-12-22T12:33:46.657

0

Interestingly this also works, with elegant output

    df.groupby('ID')['Val1'].corr(df['Val2'])

edited Dec 22 '22 at 12:33

answered Dec 22 '22 at 12:30

Asim Noor

1
1

Pandas Correlation Groupby

6 Answers6

For the more general case of 3+ variables

Linked