Grouping Dataframe by Multiple Columns, and Then Dropping Duplicates

Question

I have a dataframe which looks like this (see table). For simplicity sake I've "aapl" is the only ticker shown. However, the real dataframe has more tickers.

ticker	year	return
aapl	1999	1
aapl	2000	3
aapl	2000	2

What I'd like to do is first group the dataframe by ticker, then by year. Next, I'd like to remove any duplicate years. In the end the dataframe should look like this:

ticker	year	return
aapl	1999	1
aapl	2000	3

I have a working solution, but it's not very "Pandas-esque", and involves for loops. I'm semi-certain that if I come back to the solution in three months, it'll be completely foreign to me.

Right now, I've been working on the following, with little luck:

df = df.groupby('ticker').groupby('year').drop_duplicates(subset=['year'])

This however, produces the following error:

AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby'

Any help here would be greatly appreciated, thanks.

Your criteria for dropping the duplicated year is based on the highest return? — , Aug 30 '22 at 00:01
Alternatively, you can check on this post to see which method you would like to use https://stackoverflow.com/a/73208075/16836078 — , Aug 30 '22 at 00:33

Vladimir Fokow · Accepted Answer · 2022-08-30T00:21:46.670

@QuangHoang provided the simplest version in the comments:

df.drop_duplicates(['ticker', 'year'])

Alternatively, you can use .groupby twice, inside two .applys:

df.groupby("ticker", group_keys=False).apply(lambda x: 
    x.groupby("year", group_keys=False).apply(lambda x: x.drop_duplicates(['year']))
)

Alternatively, you can use the .duplicated function:

df.groupby('ticker', group_keys=False).apply(lambda x: 
    x[~x['year'].duplicated(keep='first')])
)

score 0 · Answer 2 · answered Aug 30 '22 at 00:04

0

You can try to sort the values first and then groupby.tail

df.sort_values('return').groupby(['ticker','year']).tail(1)

  ticker  year  return
0   aapl  1999       1
1   aapl  2000       3

answered Aug 30 '22 at 00:04

Jg9912. · Answer 3 · 2022-08-30T00:23:00.953

0

I'm almost sure you want to do this:

df.drop_duplicates(subset=["ticker","year"])

output

edited Aug 30 '22 at 00:23

answered Aug 30 '22 at 00:21

Jg9912.

27
4

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Sep 03 '22 at 08:25

Grouping Dataframe by Multiple Columns, and Then Dropping Duplicates

3 Answers3