Applying a user defined function to each subgroup of Group By in Pandas

Question

I've been working with pandas a little bit now, but I'm really getting my feet wet in the group by function.

I have the following function defined, which ultimately sorts and assigns values to new columns R, F, M, and RFM:

def get_rfm(dataframe):
    dfr=dataframe.sort('last_order_date', ascending=True)
    get_var(dfr.R)

    dff=dfr.sort('number_of_orders', ascending=True)
    get_var(dff.F)

    dfm=dff.sort('total_price',ascending=True)
    get_var(dfm.M)

    dfm.RFM[:]=dfm['R']+dfm['M']+dfm['F']
    dfrfm=dfm.sort('RFM', ascending=True)
    print(dfrfm.info())
    return dfrfm

I run this function on my pandas dataframe, and get what looks like the expected results. I return it into a new df, which I then run some statistics on.

What I now want to do is run a group by function on the dataframe, grouping them by one of the other columns, and perform this analysis on the subgroup. I try

df.groupby('size_of_business').apply(get_rfm)

But the results are not what I expected. I am returned a Dataframe that seems to be multiIndexed

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 57196 entries, ( Did Not Answer, 67103) to (More than 10 people, 5617)
Data columns (total 11 columns):

which is then followed by the list of columns. The first parts of the multiindex should be the names i grouped the dataframe by, followed by what looks to be the index.

I thought apply treated each group as a sub-dataframe, which i can then manipulate and then return. I believe my understanding of the structure is flawed, and I've had trouble finding anything to help correct myself.

What do you want the result to be? I'm guessing that "Did not answer" and "More than 10 people" are the values you're grouping on, and the other part of the index (numbers 67103 and 5617) are the index of the original DataFrame, now permuted. This is the normal way it works: the grouped-by elements are added as a new index level. What are you hoping to get? — BrenBarn, Dec 09 '13 at 21:09
After running this function, I was hoping to be able to reaccess each subgroup and perform further analysis on it. But I'm curious about the resulting format. After I perform my groupby function, I can use the describe() function, and it will return a table subindexed by each grouped name, with the statistics. After my apply function, I want to look at the same type of table, but it congests it down to one, with the rows being describe parameters, without the level of group indexing — mrdst, Dec 09 '13 at 21:17
I think there's some alignment magic that happens at the end (rather than just a concat), often I find groupby apply a dark art. — Andy Hayden, Dec 09 '13 at 21:37
@mrdst: I still don't really understand what you're trying to do, but if you want to "perform further analysis" on each group, why don't you just do *that* analysis in the groupby function? That is, make a function that actually does the analysis you want done, and apply that with `groupby(...).apply(...)`, so it just returns the results of your analysis. — BrenBarn, Dec 15 '13 at 22:34

score 1 · Answer 1 · answered Dec 09 '13 at 21:34

1

You can use as_index=False:

df.groupby('size_of_business', as_index=False)

answered Dec 09 '13 at 21:34

Andy Hayden

359,921
101
625
535

This didn't really solve my problem, the output ended up being the same. I ended up getting the values to group by in a list, and ended up iterating over the list and getting each subframe with `dataframe=df[df['size_of_business']==groups]` then calling the function on the subframe. – mrdst Dec 10 '13 at 14:43

Applying a user defined function to each subgroup of Group By in Pandas

1 Answers1

Linked