pandas dataframe: subset by column + groupby another column

Question

I'm new to pandas dataframes and would appreciate help with the following problem (similar to this). I have the following data:

data = {'Cat1': [2,1,2,1,2,1,2,1,1,1,2],
        'Cat2': [0,0,0,0,0,0,1,1,1,1,1],
        'values': [1,2,3,1,2,3,1,2,3,5,1]}
my_data = DataFrame(data)

I would like to perform a ttest_ind for every category in Cat2 to distinguish between categories in Cat1.

The way I see it, I could separate the data into

cat1_1 = my_data[my_data['Cat1']==1]
cat1_2 = my_data[my_data['Cat1']==2]

And then loop through every value in Cat2 to perform a t-test:

for cat2 in [0,1]:

    subset_1 = cat1_1[cat1_1['Cat2']==cat2]
    subset_2 = cat1_2[cat1_2['Cat2']==cat2]

    t, p = ttest_ind(subset_1['values'], subset_2['values'])

But this seems really convoluted. Could there be a simpler solution, maybe with groupby? Thanks a lot!

@galaxyan Could you please elaborate what you mean by that? Thanks! — Lisa, Feb 22 '16 at 18:38
http://pandas.pydata.org/pandas-docs/stable/merging.html it may help. — galaxyan, Feb 22 '16 at 18:39
But I already have a single dataframe. I think I'm looking for ways to split the data nicely, not merge, right? But I'd be happy to hear the solution you have in mind! — Lisa, Feb 22 '16 at 18:43

score 1 · Accepted Answer · answered Feb 22 '16 at 18:47

IIUC you can try groupby by column Cat2 and apply function f:

import pandas as pd
from scipy.stats import ttest_ind

data = {'Cat1': [2,1,2,1,2,1,2,1,1,1,2],
        'Cat2': [0,0,0,0,0,0,1,1,1,1,1],
        'values': [1,2,3,1,2,3,1,2,3,5,1]}
my_data =pd.DataFrame(data)
print my_data
    Cat1  Cat2  values
0      2     0       1
1      1     0       2
2      2     0       3
3      1     0       1
4      2     0       2
5      1     0       3
6      2     1       1
7      1     1       2
8      1     1       3
9      1     1       5
10     2     1       1

def f(x):
    #print x   
    cat1_1 = x[x['Cat1']==1]
    cat1_2 = x[x['Cat1']==2]

    t, p = ttest_ind(cat1_1['values'], cat1_2['values'])
    return pd.Series({'a':t, 'b':p})     

print my_data.groupby('Cat2').apply(f) 
            a         b
Cat2                   
0     0.00000  1.000000
1     2.04939  0.132842

pandas dataframe: subset by column + groupby another column

1 Answers1