I have a large DataFrame, and the rows are grouped together in a way where I have a column of groups say group_ids (1, 2, 3, etc.), and then a column of item_ids in each of those groups. The group_ids may have 20-30 rows in them, and many of the rows will have the same item_ids. I am trying to put in a value for the first occurrence for each time an item_id appears within that group_id. If it occurs again in another group_id, it should also have a value, but only the first time it does.
I tried doing it with a for loop similar to this:
group_ids = df['group_id'].unique()
for g in group_ids:
items = []
for i in df[df['group_ids'] == g].index:
if df.loc[i, 'item'] not in items:
df.loc[i, 'first_occurence'] = 1
items.append(item)
else:
df.loc[i, 'first_occurence'] = 0
And while I think this could work, it is way too slow to be practical. I am sure someone has a better solution than that out there. I am trying to think of a better way using np.where or df.apply, but neither seem very straightforward.