0

I have a pandas dataframe that looks like this:

import pandas as pd
import numpy as np
test_df = pd.DataFrame({'group': np.append(['A'] * 50, [['B'] * 50, ['C'] * 50]),
                      'value': np.random.randn(3 * 50)})

I would like to run run a t-test between all groups A, B, C.

Is there a pythonic way to do that ? I need something more generic than manually slicing the df

quant
  • 4,062
  • 5
  • 29
  • 70
  • Possible dupe, https://stackoverflow.com/a/13413842/4985099 (look at the last solution looks similar to above problem) – sushanth Jul 27 '20 at 08:34
  • 2
    Does this answer your question? [T-test in Pandas](https://stackoverflow.com/questions/13404468/t-test-in-pandas) – Let's try Jul 27 '20 at 08:36
  • 1
    If you want to test all groups with the same test, you might want to consider an ANOVA instead of multiple t-tests – drops Jul 27 '20 at 08:38
  • I need something more generic than manually slicing the df – quant Jul 27 '20 at 08:39
  • @drops if the mean of group `A` does not differ significantly from the mean of group `B`, but it differs significantly from the mean of group `C`, then ANOVA will reject the null, but you will not know "where this rejection comes from", right ? – quant Jul 27 '20 at 08:57
  • 1
    @quant If your ANOVA leads to a rejection of the Null hypothesis, you can do a Tukey post-hoc test to see where the differences come from – drops Jul 27 '20 at 09:00
  • @drops I wasnt aware of this one. after quickly skimming through online, it seems to do the same thing as multiple t-tests. are there any big differences between multiple t-tests and the Tukey post-hoc test, according to your knowledge ? – quant Jul 27 '20 at 09:09
  • 2
    @quant if you do multiple tests instead of only one, you increase the chances of getting false positives, i.e. getting a p-value below your threshold by chance. You can also circumvent the increased probability of false positives by dividing your threshold, for example 0.05, by the number of tests you do. Then your new threshold to reject the null hypothesis is 0.05/3 = 0.0167 (also called Bonferroni correction) – drops Jul 27 '20 at 09:15
  • @drops Maybe this is turned into a stats course, but why would this be the case ? I mean why is the chance of getting more FP increased ? – quant Jul 27 '20 at 09:26
  • 1
    @quant An exaplantion would go beyond the limits of a comment. I recommend starting here https://en.wikipedia.org/wiki/Multiple_comparisons_problem – drops Jul 27 '20 at 09:35

1 Answers1

2

We can use itertools.combinations here to get the combinations of the unique values in group:

from itertools import combinations
from scipy.stats import ttest_ind

grps = test_df['group'].unique()
combs = combinations(grps, 2)

ttests = {
    f'{c1}_{c2}': ttest_ind(
        test_df.loc[test_df['group'] == c1, 'value'], 
        test_df.loc[test_df['group'] == c2, 'value']
    ) for c1, c2 in combs
}

Output

{'A_B': Ttest_indResult(statistic=1.2288295532881655, pvalue=0.22207832845954317),
 'A_C': Ttest_indResult(statistic=0.18451518261887467, pvalue=0.8539906100478168),
 'B_C': Ttest_indResult(statistic=-0.8658034013302348, pvalue=0.3887126452109223)}
Erfan
  • 40,971
  • 8
  • 66
  • 78