0

Please be informed that I already looped through various posts before turning to you. In fact, I tried to implement the solution provided in : dropping rows from dataframe based on a "not in" condition

My problem is the following. Let's assume I have I huge dataframe of which I want to remove duplicates. I'm well aware I could use drop_duplicates since it is fastest an simplest approach. However, our teacher wants us to create a list containing the IDs of the duplicates and then remove them based on if the values are contained within the aforesaid list.

#My list
list1 = ['s1' , 's2']
print(len(list1))
#My dataframe
data1 = pd.DataFrame(data={'id':['s1' , 's2', 's3', 's4', 's5' , 's6']})
print(len(data1))
#Remove all the rows that hold a value contained in list1 matched against the 'id' column
data2 = data1[~data1.id.isin(list1)]
print(len(data2))

Now, let's see the output:

Len list1 = 135
Len data1 = 8942
Len data2 = 8672

So, I came to the conclusion that my code is somehow doubling the rows to be removed and removing them.

However, when I follow the drop_duplicates approach, my code works just fine and removes the 135 rows.

Could any of you help me understand why is that happening? I tried to simplify the issue as far as possible.

Thanks a lot!

MadMarx17
  • 71
  • 7
  • `8942 - 8672 = 270 = 135 * 2` hmm.... –  Dec 07 '21 at 20:20
  • 1
    Maybe there are two rows for every id in `list1`? –  Dec 07 '21 at 20:20
  • Note that you can also remove the rows like this: `data1.set_index('id').drop(list1).reset_index()` –  Dec 07 '21 at 20:21
  • Thanks a lot, I checked the first 50 rows and, in fact, there are 2-3 rows with the same id. That being said, is there any way to cap the removal up to 1 sample? – MadMarx17 Dec 07 '21 at 21:07

1 Answers1

1

This is an extraordinarily painful way to do what you're asking. Maybe someone will see this and make a less painful way. I specifically stayed away from groupby('id').first() as means to remove duplicates because you mentioned needing to first create a list of duplicates. But that would be my next best recommendation.

Anyway, I added duplicates of s1 and s2 to your example

df = pd.DataFrame(data={'id':['s1' , 's2', 's3', 's4', 's5' , 's6', 's1' , 's2', 's2']})

Finding IDs with more than 1 entry (assuming duplicate). Here I do use groupby to get counts and keep those >1 and send unique values to the a list

dup_list = df[df.groupby('id')['id'].transform('count') > 1]['id'].unique().tolist()
print(dup_list)

['s1', 's2']

Then iterate over the list finding indices that are duplicated and removing all but the first

for id in dup_list:
    # print(df[df['id']==id].index[1:].to_list())
    drp = df[df['id']==id].index[1:].to_list()
    df.drop(drp, inplace=True)
df

   id
0  s1
1  s2
2  s3
3  s4
4  s5
5  s6

Indices 6 and 7 were dropped

Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14