1

I'm trying to write a small code to drop duplicate row based on column unique values, what I'm trying to accomplish is getting all the unique values from user_id and drop according to those unique values using drop_duplicates whilst keeping the last occurrence. keeping in mind the column that I want to drop duplicates from which is date_time.

code:

for i in recommender_train_df['user_id'].unique():
    recommender_train_df.loc[recommender_train_df['user_id'] == i].drop_duplicates(subset='date_time', keep="last", inplace=True)

problem with this code it's literally does nothing, I tried and tried and same result nothing happens.

quick note: I have 100k different user_id (unique) so I need a solution that would work as fast as possible for this problem.

  • Your requirements is weird, what's the difference with just dropping duplicates whilst keeping the last occurrence? – Ynjxsjmh Apr 10 '22 at 06:28
  • @Ynjxsjmh there might be multiple users who has the same date dropping those users would be a problem, so I wanna drop based on user_id not based on the whole data frame. let's say user 1 and user 2 had a date of 2018-04-03 dropping duplicates normally would drop either user 1 or user 2 depending on who last occurred, so going into user_id separately would only drop duplicate dates according to those users. hope this clarifies my idea. – Al-Meqdad Jabi Apr 10 '22 at 06:37

1 Answers1

1

The problem is that when you use df.loc, it is returning a copy of original dataframe, so your modification doesn't affect the original dataframe. See python - What rules does Pandas use to generate a view vs a copy? - Stack Overflow for more detail.

If you want to drop duplicated on part of column, you can get the duplicated item index and drop based on these indices:

for i in recommender_train_df['user_id'].unique():
    mask = recommender_train_df.loc[recommender_train_df['user_id'] == 15].duplicated(subset='date_time', keep="last")
    indices = mask[mask.tolist()].index
    recommender_train_df.drop(indices, inplace=True)
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
  • I applied the solution but my data is quite big, 5 million rows, it's been 30 minutes and it's still running, once it finishes I will let you know, thanks for the help. – Al-Meqdad Jabi Apr 10 '22 at 08:55
  • considering I have a 100k different user_id this method would at least 11 hours to complete, considering each run costs 0.4 seconds, I need a faster method. – Al-Meqdad Jabi Apr 10 '22 at 09:57
  • @Al-MeqdadJabi Have no idea. – Ynjxsjmh Apr 10 '22 at 10:18