Please be informed that I already looped through various posts before turning to you. In fact, I tried to implement the solution provided in : dropping rows from dataframe based on a "not in" condition
My problem is the following. Let's assume I have I huge dataframe of which I want to remove duplicates. I'm well aware I could use drop_duplicates since it is fastest an simplest approach. However, our teacher wants us to create a list containing the IDs of the duplicates and then remove them based on if the values are contained within the aforesaid list.
#My list
list1 = ['s1' , 's2']
print(len(list1))
#My dataframe
data1 = pd.DataFrame(data={'id':['s1' , 's2', 's3', 's4', 's5' , 's6']})
print(len(data1))
#Remove all the rows that hold a value contained in list1 matched against the 'id' column
data2 = data1[~data1.id.isin(list1)]
print(len(data2))
Now, let's see the output:
Len list1 = 135
Len data1 = 8942
Len data2 = 8672
So, I came to the conclusion that my code is somehow doubling the rows to be removed and removing them.
However, when I follow the drop_duplicates approach, my code works just fine and removes the 135 rows.
Could any of you help me understand why is that happening? I tried to simplify the issue as far as possible.
Thanks a lot!