3

I want to calculate and test the mean of two different groups of multiple columns in pandas, I can work the calculate part out, but no good solution so far for the test part. Below are a toy sample and the result I want.

df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=['col_1','col_2'])
df['group'] = ['A']*50 + ['B']*50

df.groupby('group').agg({"col_1":"mean","col_2":"mean"})

       col_1  col_2
group              
A      52.26  56.58
B      53.04  49.18

What I want to have:

       col_1  t_col_1  col_2 t_col_2
group              
A      52.26  4.3***   56.58 0.8
B      53.04  4.3***   49.18 0.8

In which t_col_1 is t statistics of the difference of means of col_1 in group A and group B, i.e. t.test(df.loc[df['group'].isin(['B'])][col_1], df.loc[df['group'].isin(['A'])][col_1]). The stars are not necessary but wouldb be great if they can be there.

Any suggestions on how to do this?

Jia Gao
  • 1,172
  • 3
  • 13
  • 26
  • https://stackoverflow.com/questions/13404468/t-test-in-pandas – BENY Nov 07 '19 at 22:12
  • should you do the t-test on the whole population, i.e. **before** groupby? – Quang Hoang Nov 07 '19 at 22:12
  • A `groupby.agg` isn't going to be great here because it partitions the DataFrame into separate groups and then does a calculation for each group. A two sample t-test requires you to send multiple groups into the function, though I guess the `groupby` will at least separate each group for you. – ALollz Nov 07 '19 at 22:13

1 Answers1

2

You can iterate over the columns and perform t tests by your groups:

import pandas as pd
import scipy.stats as stats

tstats = {}
ix_a = df['group'] == 'A'
for x in df:
    if x != 'group':
        tstats['t_' + x] = stats.ttest_ind(df[x][ix_a], df[x][~ix_a])[0]

df.groupby('group').mean().assign(**tstats)

Result:

       col_1  col_2  t_col_1   t_col_2
group                                 
A      56.24  46.84  0.85443 -0.281279
B      51.24  48.42  0.85443 -0.281279
busybear
  • 10,194
  • 1
  • 25
  • 42