2

I would like to make multiple .groupby() operations on different subsets of a given dataset and bind them all together. For example:

import pandas as pd
df = pd.DataFrame({"ID":[1,1,2,2,2,3],"Subset":[1,1,2,2,2,3],"Value":[5,7,4,1,7,8]})
print(df)
   ID  Subset  Value
0   1       1      5
1   1       1      7
2   2       2      4
3   2       2      1
4   2       2      7
5   3       1      9

I would then like to concatenate the following objects and store the result in a pandas data frame:

gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"]).mean()
gr2 = df[df["Subset"] == 2].groupby(["ID","Subset"]).mean()
# Why do gr1 and gr2 have column names in different rows?

I realize that df.groupby(["ID","Subset"]).mean() would give me the concatenated object I'm looking for. Just bear with me, this is a reduced example of what I'm actually dealing with.

I think the solution could be to transform gr1 and gr2 to pandas data frames and then concatenate them like I normally would.

In essence, my questions are the following:

  1. How do I convert a groupby result to a data frame object?
  2. In case this can be done without transforming the series to data frames, how do you bind two groupby results together and then transform that to a pandas data frame?

PS: I come from an R background, so to me it's odd to group a data frame by something and have the output return as a different type of object (series or multi index data frame). This is part of my question too: why does .groupby return a series? What kind of series is this? How come a series can have multiple columns and an index?

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76

1 Answers1

1

The return type in your example is a pandas MultiIndex object. To return a dataframe with a single transformation function for a single value, then you can use the following. Note the inclusion of as_index=False.


>>> gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"], as_index=False).mean()
>>> gr1

    ID  Subset  Value
0    1       1      6


This however won't work if you wish to aggregate multiple functions like here. If you wish to avoid using df.groupby(["ID","Subset"]).mean(), then you can use the following for your example.


>>> gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"], as_index=False).mean()
>>> gr2 = df[df["Subset"] == 2].groupby(["ID","Subset"], as_index=False).mean()

>>> pd.concat([gr1, gr2]).reset_index(drop=True)

   ID   Subset  Value
0   1        1      6
1   2        2      4


If you're only concerned with dealing with a specific subset of rows, the following could be applicable, since it removes the necessity to concatenate results.


>>> values = [1,2]
>>> df[df['Subset'].isin(values)].groupby(["ID","Subset"], as_index=False).mean()

    ID  Subset  Value
0   1        1      6
1   2        2      4
Josmoor98
  • 1,721
  • 10
  • 27