EdChum's answer may not always work as intended. Instead of first()
use nth(0)
.
The method first()
is affected by this bug that has gone unsolved for some years now. Instead of the expected behaviour, first()
returns the first element that is not missing in each column within each group i.e. it ignores NaN values. For example, say you had a third column with some missing values:
df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'bar', 'bar'],
'B' : ['1', '2','2', '4', '1'],
'C' : [np.nan, 'X', 'Y', 'Y', 'Y']})
A B C
0 foo 1 NaN
2 foo 2 X
3 bar 2 Y
4 bar 4 Y
5 bar 1 Y
Using first()
here (after sorting, just like EdChum correctly assessed in their answer) will skip over the missing values (note how it is mixing up values from different rows):
df.sort_values('B').groupby('A').first()
B C
A
bar 1 Y
foo 1 X
The correct way to get the full row, including missing values, is to use nth(0)
, which performs the expected operation:
df.sort_values('B').groupby('A').nth(0)
B C
A
bar 1 Y
foo 1 NaN
For completeness, this bug also affects last()
, its correct substitute being nth(-1)
.
Posting this in an answer since it's too long for a comment. Not sure this is within the scope of the question but I think it's relevant to many people looking for this answer (like myself before writing this) and is extremely easy to miss.