110

What is the best way to do a groupby on a Pandas dataframe, but exclude some columns from that groupby? e.g. I have the following dataframe:

Code   Country      Item_Code   Item    Ele_Code    Unit    Y1961    Y1962   Y1963
2      Afghanistan  15          Wheat   5312        Ha      10       20      30
2      Afghanistan  25          Maize   5312        Ha      10       20      30
4      Angola       15          Wheat   7312        Ha      30       40      50
4      Angola       25          Maize   7312        Ha      30       40      50

I want to groupby the column Country and Item_Code and only compute the sum of the rows falling under the columns Y1961, Y1962 and Y1963. The resulting dataframe should look like this:

Code   Country      Item_Code   Item    Ele_Code    Unit    Y1961    Y1962   Y1963
2      Afghanistan  15          C3      5312        Ha      20       40       60
4      Angola       25          C4      7312        Ha      60       80      100

Right now I am doing this:

df.groupby('Country').sum()

However this adds up the values in the Item_Code column as well. Is there any way I can specify which columns to include in the sum() operation and which ones to exclude?

karel
  • 5,489
  • 46
  • 45
  • 50
user308827
  • 21,227
  • 87
  • 254
  • 417

4 Answers4

147

You can select the columns of a groupby:

In [11]: df.groupby(['Country', 'Item_Code'])[["Y1961", "Y1962", "Y1963"]].sum()
Out[11]:
                       Y1961  Y1962  Y1963
Country     Item_Code
Afghanistan 15            10     20     30
            25            10     20     30
Angola      15            30     40     50
            25            30     40     50

Note that the list passed must be a subset of the columns otherwise you'll see a KeyError.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
48

The agg function will do this for you. Pass the columns and function as a dict with column, output:

df.groupby(['Country', 'Item_Code']).agg({'Y1961': np.sum, 'Y1962': [np.sum, np.mean]})  # Added example for two output columns from a single input column

This will display only the group by columns, and the specified aggregate columns. In this example I included two agg functions applied to 'Y1962'.

To get exactly what you hoped to see, included the other columns in the group by, and apply sums to the Y variables in the frame:

df.groupby(['Code', 'Country', 'Item_Code', 'Item', 'Ele_Code', 'Unit']).agg({'Y1961': np.sum, 'Y1962': np.sum, 'Y1963': np.sum})
leroyJr
  • 1,110
  • 9
  • 17
17

If you are looking for a more generalized way to apply to many columns, what you can do is to build a list of column names and pass it as the index of the grouped dataframe. In your case, for example:

columns = ['Y'+str(i) for year in range(1967, 2011)]

df.groupby('Country')[columns].agg('sum')
Richard
  • 56,349
  • 34
  • 180
  • 251
Superstar
  • 419
  • 1
  • 5
  • 7
2

If you want to add a suffix/prefix to the aggregated column names, use add_suffix() / add_prefix().

df.groupby(["Code", "Country"])[["Y1961", "Y1962", "Y1963"]].sum().add_suffix("_total")

suffix


If you want to retain Code and Country as columns after aggregation, set as_index=False in groupby() or use reset_index().

df.groupby(["Code", "Country"], as_index=False)[["Y1961", "Y1962", "Y1963"]].sum()
# df.groupby(["Code", "Country"])[["Y1961", "Y1962", "Y1963"]].sum().reset_index()

as_index

cottontail
  • 10,268
  • 18
  • 50
  • 51