Looping over multiple variable combinations in pandas with groupby

Question

So I have some data like:

df = pandas.DataFrame({"X1":[1,2,5]*4, "X2":[1,2,10]*4, 
                       "Y":[2,4,6]*4, "group":["A","B"]*6})

And I want to create a table of linear regression slope coefficients, for each group, and for each relevant combination of variables, something along the lines of:

group  x   y  coef
A      X1  Y  0.97
A      X2  Y  0.85
B      X1  Y  0.73
B      X2  Y  0.81

I'm trying to do it something like this:

def OLS_slope_coef(df, xcol=0, ycol=1):
  x = df.ix[:,xcol]
  y = df.ix[:,ycol]
  slope, intercept, r, p, stderr = scipy.stats.linregress(x, y)
  return(slope)


s_df = pandas.DataFrame()
for x in ['X1', 'X2']:
    for y in ['Y']:
        s_df.ix[(x, y), 'coef'] = df.groupby('group').apply(OLS_slope_coef, x, y)

But it gives a ValueError: Incompatible indexer with Series.

Is there some way to do something like this? I don't care if the group, x, and y variables are indexes or dataframe columns (I'm going to .reset_index() anyway).

score 0 · Answer 1 · answered Mar 20 '14 at 11:57

The problem is that .apply returns a series with two elements (because there are two groups), indexed as 'A' and 'B', and so that is incompatible with .ix[(x,y), 'coef']. you can either do:

s_df = pd.DataFrame(index=['A', 'B'])
for x in ['X1', 'X2']:
    for y in ['Y']:
        s_df.loc[:, x + '-coef'] = df.groupby('group').apply(OLS_slope_coef, x, y)

which results in:

   X1-coef  X2-coef
A     0.92     0.37
B     0.92     0.37

[2 rows x 2 columns]

or, loop inside the function which is applied and return a data-frame:

import pandas as pd
def ols(df, xcols):
    from itertools import chain
    from scipy.stats import linregress
    fitcols = ['slope', 'intercept', 'rval', 'pval', 'stderr']
    cols = pd.MultiIndex.from_tuples([(var, k) for var in xcols for k in fitcols])
    fit = [linregress(df[xcol], df.Y) for xcol in xcols]
    return pd.DataFrame([list(chain(*fit))], columns=cols)

fit = df.groupby('group').apply(ols, xcols=['X1', 'X2'])
fit.reset_index(level=1, drop=True, inplace=True)

which would give:

          X1                                    X2                               
       slope  intercept  rval  pval  stderr  slope  intercept  rval  pval  stderr
group                                                                            
A       0.92       1.54  0.96     0    0.13   0.37        2.4  0.91  0.01    0.08
B       0.92       1.54  0.96     0    0.13   0.37        2.4  0.91  0.01    0.08

[2 rows x 10 columns]

The y column in the output table that I have is relevant - I want to be able to do all combinations of multiple independent variables (X1, X2,.., Xn) **and** multiple dependent variables (Y1, Y2,.., Yn). I suppose I could use your first suggestion with the y variable, then do `pandas.melt()` to get the column headers in a column, then do a column split on the hyphen or something. — naught101, Mar 21 '14 at 00:58
Last part of that solution is here: http://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-columns — naught101, Mar 21 '14 at 01:28

Looping over multiple variable combinations in pandas with groupby

1 Answers1