I have a pandas DataFrame
like this:
n = 6000
my_data = DataFrame ({
"Category" : np.random.choice (['cat1','cat2'], size=n) ,
"val_1" : np.random.randn(n) ,
"val_2" : [i for i in range (1,n+1)]
})
I want to calculate the count of one column and the means of the other two, aggregating by Category
. This is described in the pandas documentation as "Applying different functions to DataFrame columns", and I do it like this:
counts_and_means = \
my_data.groupby("Category").agg (
{
"Category" : np.count_nonzero ,
"val_1" : np.mean ,
"val_2" : np.mean
}
)
I also want to calculate a t-test p-variable for val_2
, testing the hypothesis that the mean of val_2
is zero. If val_2
were the only column I was doing anything with throughout this whole process, I could just do what is described in the Pandas documentation as "Applying multiple functions at once." However, I'm trying to do both multiple columns AND multiple functions. I can explicitly name output columns when it's just the "multiple functions at once" case, but I can't figure out how to do it when there are also multiple columns involved. Right now when I try to do this all in one agg(...)
step, the val_2
p-value column definition overwrites the original mean column definition, because they're both in the same dict
. So, I end up needing to create a second DataFrame
and joining them:
val_tests = \
my_data.groupby("Category").agg (
{
"val_2" : lambda arr : sp.stats.ttest_1samp(arr, popmean=0)[1]
}
) \
.rename (columns={"val_2" : "p_val_2"})
results = pd.merge(counts_and_means, val_tests, left_index=True, right_index=True)
My question: is there some way to do this all in one agg(...)
step, without having to create a second result DataFrame
and performing the merge
?
(See my other closely-related agg
question here.)