We are working to get off of SAS and onto Python/Pandas. However, one thing we are having trouble with is creating a replacement for PROC SUMMARY
(AKA PROC MEANS
) that has the SAS routine's flexibility. For non-SAS users: PROC SUMMARY
is just a routine that produces a table containing "descriptive statistics for variables across all observations or within groups of observations" in a dataset, to paraphrase the SAS documentation. Our requirements are just a small subset of the full functionality - outputting a table where we have:
- Ability to apply different stats to different columns (for now just count, sum, mean, weighted mean)
- Ability to handle zero to many grouping variables
- Ability to specify a weight variable for weighted mean
We are not trying to do anything else (anything graphical, etc.)
Here is what we have so far:
def wmean_ungrouped (d,w):
return (d.dot(w)).sum() / w.sum()
def wmean_grouped (group, var_name_in, var_name_weight):
d = group[var_name_in]
w = group[var_name_weight]
return (d * w).sum() / w.sum()
FUNCS = {
"mean" : np.mean ,
"sum" : np.sum ,
"count" : np.count_nonzero
}
def my_summary (
data ,
var_names_in ,
var_names_out ,
var_functions ,
var_name_weight = None ,
var_names_group = None
):
result = DataFrame()
if var_names_group is not None:
grouped = data.groupby (var_names_group)
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
if var_function == "wmean":
func = lambda x : wmean_grouped (x, var_name_in, var_name_weight)
result[var_name_out] = Series(grouped.apply(func))
else:
func = FUNCS[var_function]
result[var_name_out] = grouped[var_name_in].apply(func)
else:
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
if var_function == "wmean":
result[var_name_out] = \
Series(wmean_ungrouped(data[var_name_in], data[var_name_weight]))
else:
func = FUNCS[var_function]
result[var_name_out] = Series(func(data[var_name_in]))
return result
Here is a sample call to the my_summary()
function:
my_summary (
data=df,
var_names_in=["x_1","x_1","x_1","x_1"] ,
var_names_out=[
"x_1_c","x_1_s","x_1_m","x_1_wm"
] ,
var_functions=["count","sum","mean","wmean"] ,
var_name_weight="val_1" ,
var_names_group=["Region","Category"]
)
my_summary()
works, but as you can see, its implementation is not the prettiest. Here are the main issues:
- Two different code paths depending on grouped or ungrouped - this stems completely from the fact that
DataFrame
andDataFrameGroupBy
have different ways for applying a programmatically-selected reducing function to a single column. ForDataFrame
, the only way I've found is directly invokingfunc(data[var_name_in])
.data[var_name_in].apply(func)
doesn't work becauseapply()
on aSeries
does not reduce (unlikeapply()
on aDataFrame
). On the other hand, forDataFrameGroupBy
, I have to use that very approach:grouped[var_name_in].apply(func)
. That's because something likefunc(grouped[var_name_in])
will not work (no reason it should.) - Special treatment for weighted mean - this is because it operates on two columns, unlike all the other calculations, which operate on just one; I don't know if this can be helped.
- Two different weighted mean functions - this is a consequence of the first issue. The ungrouped function has
Series
-type parameters and needsdot()
to multiply and reduce them; the grouped function eventually deals withSeriesGroupBy
objects and has to use the*
operator (acknowledgements to the answer to this SO post for the weighted average function code.)
So my questions are:
- Is there something native to pandas that can do all of this (i.e. throw out the above and use that instead)?
- If not, are there any fixes to any of the issues mentioned above?
- By any chance, is there some way to group by nothing - that is, to obtain a
DataFrameGroupBy
object from aDataFrame
without grouping on any variable? Then the code paths would be reduced as we would be dealing with theDataFrameGroupBy
interface exclusively.
Update (old - scroll down for current)
@JohnE's answer provides a way to group by nothing: groupby(lambda x: True)
. This is a workaround that he spotted in this SO post (which, incidentally, features an answer from Wes himself speaking of the need for a DataFrame.agg()
, which would serve the same purpose). @JohnE's excellent solution allows us to deal exclusively with objects of type DataFrameGroupBy
, and instantly reduces most of the code paths. I was able to reduce further using some functional gimmickry that is now possible because we have only DataFrameGroupBy
instances. Basically, all functions are generated as needed - the "generators" (in quotes here so as to not be confused with Python generator expressions) take two parameters: value column name and weight column name, the second of which is ignored in all cases except wmean
. The generated functions are always applied over the entire DataFrameGroupBy
, as was originally the case just with wmean
, with the parameters being the correct column name(s) to use. I also replaced all the np.*
implementations with pandas calculations, to better deal with NaN
values.
Unless there is something native to pandas that can do this, this is our solution:
FUNC_GENS = {
"mean" : lambda y,z : lambda x : x[y].mean(),
"sum" : lambda y,z : lambda x : x[y].sum() ,
"count" : lambda y,z : lambda x : x[y].count() ,
"wmean" : lambda y,z : lambda x : (x[y] * x[z]).sum() / x[z].sum()
}
def my_summary (
data ,
var_names_in ,
var_names_out ,
var_functions ,
var_name_weight = None ,
var_names_group = None ):
result = pd.DataFrame()
if var_names_group is None:
grouped = data.groupby (lambda x: True)
else:
grouped = data.groupby (var_names_group)
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
func_gen = FUNC_GENS[var_function]
func = func_gen (var_name_in, var_name_weight)
result[var_name_out] = grouped.apply(func)
return result
Update 2019 / Current Solution
Versions of pandas released subsequent to my original post now implement most of this functionality:
- Grouping by nothing - Wes M. had spoken in the past of the need for a
DataFrame.agg()
and it was indeed added in version 0.20.0 along with aSeries.agg()
. - Multiple aggregations of multiple columns, with specified names for output columns - this is now part of pandas as of version 0.25.+ in the form of
NamedAgg
inputs to theagg()
function
So, basically everything except the weighted average. A good current solution for that is here.