I have a pandas DataFrame
like this:
n = 6000
my_data = DataFrame ({
"Category" : np.random.choice (['cat1','cat2'], size=n) ,
"val_1" : np.random.randn(n) ,
"val_2" : [i for i in range (1,n+1)]
})
I am aggregating on Category
, and applying different functions to different columns, like so:
counts_and_means = \
my_data.groupby("Category").agg (
{
"Category" : np.count_nonzero ,
"val_1" : np.mean ,
"val_2" : np.mean
}
)
After this finishes, I want an explicit column ordering and new column names. I do that with reindex
and rename
, chaining them with the original aggregation in a fluent style, like so:
counts_and_means = \
my_data.groupby("Category").agg (
{
"Category" : np.count_nonzero ,
"val_1" : np.mean ,
"val_2" : np.mean
}
) \
.reindex (columns = ["Category","val_1","val_2"]) \
.rename (
columns = {
"Category" : "Count" ,
"val_1" : "Avg. Val_1" ,
"val_2" : "Avg. Val_2" ,
}
)
Is this the best way (in terms of idiom, performance, etc.)? Or is there a way to explicitly specify the column names and ordering right in the agg(...)
step?
I am asking because I am new to the idioms of this API and want to get them right, and because it looks like reindex
and rename
both create DataFrame
copies, which could be a bigger issue with large data sets (I am aware of the inplace
parameter for rename
, but that wouldn't work in my fluent setup). Any help/advice is greatly appreciated.