Updating list inside for loop, which is using this list

Question

I am working on a dataframe with some values inside. The problem is, I might have duplicates.

I went on this link but i couldn't find what I needed

What I tried is to create a list of duplicates using df.duplicated() which gives me True and False values for each indexes.
Then for each index in this list where the result is True, I get the id from the df using df.loc[(df['id']== df['id'][dups]) ]. Depending on this result I call a function giveID() which returns a list of indexes to delete from the duplicates list. Because i don't need to iterate on the duplicates that are supposed to be deleted, is it possible to delete these indexes from the duplicates list during the for loop without breaking everything ?

Here is an example of my df (the duplicates are based on id column) :

   | id | type
--------------
0  | 312| data2
1  | 334| data
2  | 22 | data1
3  | 312| data8
#Here 0 and 3 are duplicates based on ID

Here is a part of my code:

duplicates = df.duplicated(subset='column_name',keep=False)
duplicates = duplicates[duplicates]


df_dup = df
listidx = []
i=0
for dups in duplicates.index:

    dup_id = df.loc[(df['id']== df['id'][dups])]
    for a in giveID(dup_id):
        if a not in listid:
            listidx.append(a)

#here i want to delete the all listidx from duplicates inside the for loop
#so that I don't iterate over unnecessary duplicates

def giveID(id)
#some code that returns a list of indexes

This is how looks duplicates in my code:

0          True
1          True
582        True
583        True
605        True
606        True
622        True
623        True
624        True
625        True
626        True
627        True
628        True
629        True
630        True
631        True
           ... 
1990368    True
1991030    True

And i would like get the same but without unnecessary duplicates

AT_asks · Answer 1 · 2019-05-27T14:41:59.230

0

If you need indexes of non-duplicated IDs:

df = pd.DataFrame({'ID':[0,1,1,3], 'B':[0,1,2,3]})
   B  ID
0  0   0
1  1   1
2  2   1
3  3   3

# List of indexes
non_duplicated = df.drop_duplicates(subset='ID', keep=False).index

df.loc[df.index.isin(non_duplicated)]
   B  ID
0  0   0
3  3   3

edited May 27 '19 at 14:41

answered May 27 '19 at 14:04

AT_asks

132
4

I don't need the non-duplicated because they are not a problem ;) – Hanggy May 27 '19 at 14:36
Do you need duplicated? – AT_asks May 27 '19 at 14:41
Everything i need is in hte question ;) But what i need is know how to delete the duplicated ID's indexes from the list i'm iterating over :) – Hanggy May 27 '19 at 14:58
You could select non-duplicated items or use set method .difference() to get items that you need – AT_asks May 28 '19 at 05:51
As said in my question I already have a function that gives me the duplicated indexes ;) – Hanggy May 28 '19 at 06:34

Updating list inside for loop, which is using this list

1 Answers1