I have a dataframe containing duplicates, that are flagged by a specific variable. The df looks like this:
The idea is that the rows to keep and its duplicates are stacked in batches (a pair or more if many duplicates)and identified by the "duplicate" column. I would like, for each batch, to keep the row depending on one conditions: it has to be the row with the smallest number of empty cells. For Alice for instance, it should be the second row (and not the one flagged "keep"). The difficulty lies also in the fact that I cannot group by on the "name", "lastname" or "phone" column, because they are not always filled (the duplicates are computed on these 3 concatenated columns by a ML algo).
Unlike already posted questions I've seen (how do I remove rows with duplicate values of columns in pandas data frame?), here the conditions to select the row to keep is not fixed (like keeping the first row or the last withing the batch of duplicates) but depends on the rows completion in each batch.
How can I parse the dataframe according to this column "duplicate", and among each batch extract the row I want ?
I tried to assign an unique label for each batch, in order to iterate over these label, but it fails.