I am working on a dataframe with some values inside. The problem is, I might have duplicates.
I went on this link but i couldn't find what I needed
What I tried is to create a list of duplicates using df.duplicated()
which gives me True
and False
values for each indexes.
Then for each index in this list where the result is True
, I get the id from the df using df.loc[(df['id']== df['id'][dups]) ]
. Depending on this result I call a function giveID() which returns a list of indexes to delete from the duplicates list. Because i don't need to iterate on the duplicates that are supposed to be deleted, is it possible to delete these indexes from the duplicates list during the for
loop without breaking everything ?
Here is an example of my df (the duplicates are based on id column) :
| id | type
--------------
0 | 312| data2
1 | 334| data
2 | 22 | data1
3 | 312| data8
#Here 0 and 3 are duplicates based on ID
Here is a part of my code:
duplicates = df.duplicated(subset='column_name',keep=False)
duplicates = duplicates[duplicates]
df_dup = df
listidx = []
i=0
for dups in duplicates.index:
dup_id = df.loc[(df['id']== df['id'][dups])]
for a in giveID(dup_id):
if a not in listid:
listidx.append(a)
#here i want to delete the all listidx from duplicates inside the for loop
#so that I don't iterate over unnecessary duplicates
def giveID(id)
#some code that returns a list of indexes
This is how looks duplicates
in my code:
0 True
1 True
582 True
583 True
605 True
606 True
622 True
623 True
624 True
625 True
626 True
627 True
628 True
629 True
630 True
631 True
...
1990368 True
1991030 True
And i would like get the same but without unnecessary duplicates