2

I have a pandas DataFrame like this:

n = 6000
my_data = DataFrame ({
    "Category"  : np.random.choice (['cat1','cat2'], size=n) ,
    "val_1"     : np.random.randn(n) ,
    "val_2"     : [i for i in range (1,n+1)]
})

I want to calculate the count of one column and the means of the other two, aggregating by Category. This is described in the pandas documentation as "Applying different functions to DataFrame columns", and I do it like this:

counts_and_means = \
    my_data.groupby("Category").agg (
        {
            "Category"  : np.count_nonzero ,
            "val_1"     : np.mean ,
            "val_2"     : np.mean
        }
    )

I also want to calculate a t-test p-variable for val_2, testing the hypothesis that the mean of val_2 is zero. If val_2 were the only column I was doing anything with throughout this whole process, I could just do what is described in the Pandas documentation as "Applying multiple functions at once." However, I'm trying to do both multiple columns AND multiple functions. I can explicitly name output columns when it's just the "multiple functions at once" case, but I can't figure out how to do it when there are also multiple columns involved. Right now when I try to do this all in one agg(...) step, the val_2 p-value column definition overwrites the original mean column definition, because they're both in the same dict. So, I end up needing to create a second DataFrame and joining them:

val_tests = \
    my_data.groupby("Category").agg (
        {
            "val_2"     : lambda arr : sp.stats.ttest_1samp(arr, popmean=0)[1]
        }
    ) \
    .rename (columns={"val_2" : "p_val_2"})

results = pd.merge(counts_and_means, val_tests, left_index=True, right_index=True)

My question: is there some way to do this all in one agg(...) step, without having to create a second result DataFrame and performing the merge?

(See my other closely-related agg question here.)

Community
  • 1
  • 1
sparc_spread
  • 10,643
  • 11
  • 45
  • 59

1 Answers1

2

You could try this. I'm passing a list of functions for val_2

def ttest(arr):     
    return stats.ttest_1samp(arr, popmean=0)[1]

counts_and_means = \
    my_data.groupby("Category").agg (
        {
            "Category"  : np.count_nonzero ,
            "val_1"     : np.mean ,
            "val_2"     : [np.mean,ttest]
        }
    )

if you prefer a slightly shorter version

counts_and_means = \
    my_data.groupby("Category").agg (
        {
            "Category"  : np.count_nonzero ,
            "val_1"     : np.mean ,
            "val_2"     : [np.mean,lambda arr : stats.ttest_1samp(arr, popmean=0)[1]]
        }
    )

This yields

     Category      val_1            val_2
         count_nonzero  mean        mean    ttest
Category                
cat1     3059           0.007861    2990.997712 0
cat2     2941          -0.007450    3010.383543 0

Per your comment below, you can "flatten" your columns

You can drop a level

counts_and_means.columns = counts_and_means.columns.droplevel()
counts_and_means

but that removes the top level leaving you duplicate column names

           count_nonzero    mean        mean       ttest
Category                
cat1       3059             0.007861    2990.997712 0
cat2       2941            -0.007450    3010.383543 0

This might be a better option. It uses a list comp to concat your column names which I got from here

counts_and_means.columns = ['%s%s' % (a, '_%s' % b if b else '') 
                           for a, b in counts_and_means.columns]
counts_and_means

            Category_count_nonzero  val_1_mean  val_2_mean   val_2_ttest
Category                
cat1        3059                    0.007861    2990.997712  0
cat2         2941                  -0.007450    3010.383543  0
Community
  • 1
  • 1
Bob Haffner
  • 8,235
  • 1
  • 36
  • 43
  • This looks great, though I must confess I am new to `MultiIndex`, which is what the resulting `DatafFrame` has. Do you know a way of "flattening" the `MultiIndex` into a regular 1-dimensional index of columns? – sparc_spread Apr 20 '15 at 03:05
  • Yeah, i think I know what you mean. I'll modify my answer – Bob Haffner Apr 20 '15 at 03:13