35

Basically the same as Select first row in each GROUP BY group? only in pandas.

df = pd.DataFrame({'A' : ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar'],
                'B' : ['3', '1', '2', '4','2', '4', '1', '3'],
                    })

Sorting looks promising:

df.sort('B')

     A  B
1  foo  1
6  bar  1
2  foo  2
4  bar  2
0  foo  3
7  bar  3
3  foo  4
5  bar  4

But then first won't give the desired result... df.groupby('A').first()

     B
A     
bar  2
foo  3
Community
  • 1
  • 1
ihadanny
  • 4,377
  • 7
  • 45
  • 76

5 Answers5

34

Generally if you want your data sorted in a groupby but it's not one of the columns which are going to be grouped on then it's better to sort the df prior to performing groupby:

In [5]:
df.sort_values('B').groupby('A').first()

Out[5]:
     B
A     
bar  1
foo  1
JohnE
  • 29,156
  • 8
  • 79
  • 109
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • 1
    hey, I'm preferring @JohnE's method as it's cleaner and more SQL-like. Maybe in pandas you can rely on sorting to be stable after grouping by another column (timsort?) but it's not as obvious as the `rank` syntax – ihadanny May 27 '15 at 21:30
  • @ihadanny. I totally agree, poor idea on pandas side. – user1700890 Apr 24 '19 at 16:45
  • I know this is old, but as of November 2022 [`first`'s behaviour](https://github.com/pandas-dev/pandas/issues/8427) is does not do what the OP asked. For most cases, I think users might want to use `nth(0)`. Added a [full answer](https://stackoverflow.com/a/58224848/9786776) a while back explaining the difference. – Dídac Fernández Nov 29 '22 at 12:23
16

Here's an alternative approach using groupby().rank():

df[ df.groupby('A')['B'].rank() == 1 ]

     A  B
1  foo  1
6  bar  1

This gives you the same answer as @EdChum's for the OP's sample dataframe, but could give a different answer if you have any ties during the sort, for example, with data like this:

df = pd.DataFrame({'A': ['foo', 'foo', 'bar', 'bar'], 
                   'B': ['2', '1', '1', '1'] })

In this case you have some options using the optional method argument, depending on how you wish to handle sorting ties:

df[ df.groupby('A')['B'].rank(method='average') == 1 ]   # the default
df[ df.groupby('A')['B'].rank(method='min')     == 1 ]
df[ df.groupby('A')['B'].rank(method='first')   == 1 ]   # doesn't work, not sure why
JohnE
  • 29,156
  • 8
  • 79
  • 109
  • 1
    The problem is that ``df.sort('B')`` returns a sorted *copy* of df, it does not change df itself. If you want to change df, you need to do one of the following: ``df = df.sort_values('B')`` or ``df.sort_values(inplace=True)``. (pandas now uses ``sort_values`` instead of ``sort``.) – prooffreader Nov 02 '16 at 20:13
14

The pandas groupby function could be used for what you want, but it's really meant for aggregation. This is a simple 'take the first' operation.

What you actually want is the pandas drop_duplicates function, which by default will return the first row. What you usually would consider the groupby key, you should pass as the subset= variable

df.drop_duplicates(subset='A')

Should do what you want.

Also, df.sort('A') does not sort the DataFrame df, it returns a copy which is sorted. If you want to sort it, you have to add the inplace=True parameter.

df.sort('A', inplace=True)
firelynx
  • 30,616
  • 9
  • 91
  • 101
12

EdChum's answer may not always work as intended. Instead of first() use nth(0).

The method first() is affected by this bug that has gone unsolved for some years now. Instead of the expected behaviour, first() returns the first element that is not missing in each column within each group i.e. it ignores NaN values. For example, say you had a third column with some missing values:

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'bar', 'bar'],
                   'B' : ['1', '2','2', '4', '1'],
                   'C' : [np.nan, 'X', 'Y', 'Y', 'Y']})

    A   B   C
0   foo 1   NaN
2   foo 2   X
3   bar 2   Y
4   bar 4   Y
5   bar 1   Y

Using first() here (after sorting, just like EdChum correctly assessed in their answer) will skip over the missing values (note how it is mixing up values from different rows):

df.sort_values('B').groupby('A').first()

    B   C
A       
bar 1   Y
foo 1   X

The correct way to get the full row, including missing values, is to use nth(0), which performs the expected operation:

df.sort_values('B').groupby('A').nth(0)

    B   C
A       
bar 1   Y
foo 1   NaN

For completeness, this bug also affects last(), its correct substitute being nth(-1).

Posting this in an answer since it's too long for a comment. Not sure this is within the scope of the question but I think it's relevant to many people looking for this answer (like myself before writing this) and is extremely easy to miss.

  • 3
    This answer is probably the most important within this thread as of early 2020. The longstanding bug with pandas aggregation is so silent that it appears to work, until you inspect your output real carefully - especially with large dataset. Glad that nth() provides a quick way out – kerwei Jan 22 '20 at 08:43
5

Use a sort method and then keep the first value using .drop_duplicates() with its keep argument set to 'first' (default). This approach has the benefit that it keeps your index.

Example:

df.sort_values('B', ascending=True) \
  .drop_duplicates('A', keep='first')

Output:

    A   B
1   foo 1
6   bar 1

You don’t need a GroupBy in this case because there is no need to run a calculation on each group (see: split-apply-combine pattern).

fpersyn
  • 1,045
  • 1
  • 12
  • 19
  • Note - You can keep each first unique combination by feeding `.sort_values()` and `.drop_duplicates()` a list of column names. – fpersyn May 31 '19 at 13:01
  • 1
    The default value for the `keep` argument in `.drop_duplicates()` is "first". Setting it explicitly is optional but can improve readability in contexts like these. – fpersyn May 31 '19 at 14:14