1

I have a large DataFrame, and the rows are grouped together in a way where I have a column of groups say group_ids (1, 2, 3, etc.), and then a column of item_ids in each of those groups. The group_ids may have 20-30 rows in them, and many of the rows will have the same item_ids. I am trying to put in a value for the first occurrence for each time an item_id appears within that group_id. If it occurs again in another group_id, it should also have a value, but only the first time it does.

I tried doing it with a for loop similar to this:

group_ids = df['group_id'].unique()
for g in group_ids:
   items = []
   for i in df[df['group_ids'] == g].index:
      if df.loc[i, 'item'] not in items:
         df.loc[i, 'first_occurence'] = 1
         items.append(item)
   else: 
         df.loc[i, 'first_occurence'] = 0

And while I think this could work, it is way too slow to be practical. I am sure someone has a better solution than that out there. I am trying to think of a better way using np.where or df.apply, but neither seem very straightforward.

Emac
  • 1,098
  • 3
  • 18
  • 37
  • Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on how to ask a good question may also be useful. – yatu Mar 25 '20 at 18:24
  • 1
    `df.groupby(['group_ids','item_ids'])['time'].first()` or `df.duplicated(['group_ids', 'item_ids'])`? – Quang Hoang Mar 25 '20 at 18:30
  • Thanks, I think df.duplicated is what I want! – Emac Mar 25 '20 at 18:39

0 Answers0