62

I have a data frame and I would like to group it by a particular column (or, in other words, by values from a particular column). I can do it in the following way: grouped = df.groupby(['ColumnName']).

I imagine the result of this operation as a table in which some cells can contain sets of values instead of single values. To get a usual table (i.e. a table in which every cell contains only one a single value) I need to indicate what function I want to use to transform the sets of values in the cells into single values.

For example I can replace sets of values by their sum, or by their minimal or maximal value. I can do it in the following way: grouped.sum() or grouped.min() and so on.

Now I want to use different functions for different columns. I figured out that I can do it in the following way: grouped.agg({'ColumnName1':sum, 'ColumnName2':min}).

However, because of some reasons I cannot use first. In more details, grouped.first() works, but grouped.agg({'ColumnName1':first, 'ColumnName2':first}) does not work. As a result I get a NameError: NameError: name 'first' is not defined. So, my question is: Why does it happen and how to resolve this problem.

ADDED

Here I found the following example:

grouped['D'].agg({'result1' : np.sum, 'result2' : np.mean})

May be I also need to use np? But in my case python does not recognize "np". Should I import it?

JJJ
  • 1,009
  • 6
  • 19
  • 31
Roman
  • 124,451
  • 167
  • 349
  • 456
  • 2
    You don't need `np`, it'll work with plain old `sum` (only less efficiently). numpy is imported with pandas (if you `import pandas as pd` it's `pd.np`) but most people will also import it separately for convenience. – Andy Hayden Feb 21 '13 at 12:59

5 Answers5

67

I think the issue is that there are two different first methods which share a name but act differently, one is for groupby objects and another for a Series/DataFrame (to do with timeseries).

To replicate the behaviour of the groupby first method over a DataFrame using agg you could use iloc[0] (which gets the first row in each group (DataFrame/Series) by index):

grouped.agg(lambda x: x.iloc[0])

For example:

In [1]: df = pd.DataFrame([[1, 2], [3, 4]])

In [2]: g = df.groupby(0)

In [3]: g.first()
Out[3]: 
   1
0   
1  2
3  4

In [4]: g.agg(lambda x: x.iloc[0])
Out[4]: 
   1
0   
1  2
3  4

Analogously you can replicate last using iloc[-1].

Note: This will works column-wise, et al:

g.agg({1: lambda x: x.iloc[0]})

In older version of pandas you could would use the irow method (e.g. x.irow(0), see previous edits.


A couple of updated notes:

This is better done using the nth groupby method, which is much faster >=0.13:

g.nth(0)  # first
g.nth(-1)  # last

You have to take care a little, as the default behaviour for first and last ignores NaN rows... and IIRC for DataFrame groupbys it was broken pre-0.13... there's a dropna option for nth.

You can use the strings rather than built-ins (though IIRC pandas spots it's the sum builtin and applies np.sum):

grouped['D'].agg({'result1' : "sum", 'result2' : "mean"})
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Just in case it's useful to anyone, according to [the docs](http://pandas.pydata.org/pandas-docs/dev/indexing.html), `irow` is now deprecated (`x.iloc[0]` does the trick instead) – cd98 Oct 30 '13 at 13:55
  • @cd98 Thanks for pointing that out, I've updated this with the newer syntax :) – Andy Hayden Oct 30 '13 at 19:55
  • 1
    I'm confused with [the docs](http://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation); it states: `Aggregating functions are ones that reduce the dimension of the returned objects, for example: mean, sum, size, count, std, var, sem, describe, first, last, nth, min, max.` So what are they talking about? – Tjorriemorrie Dec 05 '14 at 10:57
  • In some sense there's three types of mapping here: aggregation, apply and filter (the above is kind of a filter, although it uses the agg verb). This is complicated thing is that you can use **either** agg or apply to get the `.iloc[0]` job done, not sure why I used agg, apply is probably a better description. Since this post I fixed nth to work better so IMO that's the preferred solution here. – Andy Hayden Dec 05 '14 at 17:24
38

Instead of using first or last, use their string representations in the agg method. For example on the OP's case:

grouped = df.groupby(['ColumnName'])
grouped['D'].agg({'result1' : np.sum, 'result2' : np.mean})

#you can do the string representation for first and last
grouped['D'].agg({'result1' : 'first', 'result2' : 'last'})
Y.G.
  • 661
  • 7
  • 7
  • 1
    This is the much more current approach to solving this problem. – Tom Johnson Feb 23 '21 at 15:06
  • Is there a way to also pass a kwarg to the functions, e.g. `numeric_only=True`? – Jiageng Mar 22 '21 at 17:29
  • for future reference: passing dict to `SeriesGroupBy.aggregate` fails in pandas 1.3.5; reformat dict elements as method kwargs such that second example becomes `grouped['D'].agg(result1='first', result2='last')` – jbb Sep 07 '22 at 14:57
0

I'm not sure if this is really the issue, but sum and min are Python built-ins that take some iterables as input, whereas first is a method of pandas Series object, so maybe it's not in your namespace. Moreover it takes something else as an input (the doc says some offset value).

I guess one way to get around it is to create your own first function, and define it such that it takes a Series object as an input, e.g.:

def first(Series, offset):
    return Series.first(offset)

or something like that..

herrfz
  • 4,814
  • 4
  • 26
  • 37
0

I would use a custom aggregator as shown below.

d = pd.DataFrame([[1,"man"], [1, "woman"], [1, "girl"], [2,"man"], [2, "woman"]],columns = 'number family'.split())
d

Here is the output:

    number family
 0       1    man
 1       1  woman
 2       1   girl
 3       2    man
 4       2  woman

Now the Aggregation taking first and last elements.

d.groupby(by = "number").agg(firstFamily= ('family', lambda x: list(x)[0]), lastFamily =('family', lambda x: list(x)[-1]))

The output of this aggregation is shown below.

       firstFamily lastFamily
number                       
1              man       girl
2              man      woman

I hope this helps.

Samuel Nde
  • 2,565
  • 2
  • 23
  • 23
-2
c_df = b_df.groupby('time').agg(first_x=('x', lambda x: list(x)[0]),
                                last_x=('x', lambda x: list(x)[-1]),
                                last_y=('y', lambda x: list(x)[-1]))