0

I am working on a dataframe with some values inside. The problem is, I might have duplicates.

I went on this link but i couldn't find what I needed

What I tried is to create a list of duplicates using df.duplicated() which gives me True and False values for each indexes.
Then for each index in this list where the result is True, I get the id from the df using df.loc[(df['id']== df['id'][dups]) ]. Depending on this result I call a function giveID() which returns a list of indexes to delete from the duplicates list. Because i don't need to iterate on the duplicates that are supposed to be deleted, is it possible to delete these indexes from the duplicates list during the for loop without breaking everything ?

Here is an example of my df (the duplicates are based on id column) :

   | id | type
--------------
0  | 312| data2
1  | 334| data
2  | 22 | data1
3  | 312| data8
#Here 0 and 3 are duplicates based on ID

Here is a part of my code:

duplicates = df.duplicated(subset='column_name',keep=False)
duplicates = duplicates[duplicates]


df_dup = df
listidx = []
i=0
for dups in duplicates.index:

    dup_id = df.loc[(df['id']== df['id'][dups])]
    for a in giveID(dup_id):
        if a not in listid:
            listidx.append(a)

#here i want to delete the all listidx from duplicates inside the for loop
#so that I don't iterate over unnecessary duplicates

def giveID(id)
#some code that returns a list of indexes

This is how looks duplicates in my code:

0          True
1          True
582        True
583        True
605        True
606        True
622        True
623        True
624        True
625        True
626        True
627        True
628        True
629        True
630        True
631        True
           ... 
1990368    True
1991030    True

And i would like get the same but without unnecessary duplicates

Hanggy
  • 25
  • 9

1 Answers1

0

If you need indexes of non-duplicated IDs:

df = pd.DataFrame({'ID':[0,1,1,3], 'B':[0,1,2,3]})
   B  ID
0  0   0
1  1   1
2  2   1
3  3   3

# List of indexes
non_duplicated = df.drop_duplicates(subset='ID', keep=False).index

df.loc[df.index.isin(non_duplicated)]
   B  ID
0  0   0
3  3   3



AT_asks
  • 132
  • 4