2

I have a data frame df

df = pd.DataFrame([["A","X",98,56,61], ["B","E",79,54,36], ["A","Y",98,56,61],["B","F",79,54,36], ["A","Z",98,56,61], ["A","W",48,51,85],["B","G",44,57,86],["B","H",79,54,36]], columns=["id","class","c1","c2","c3"])

when we do groupby on id, if duplicate values(rows) are present based on multiple columns like c1,c2,c3, retain the row based on weighatge given on column class.

For example here when we do groupby on id A, c1,c2,c3 are duplicates for class X,Y,Z, among X,Y,Z weighatge given to X so retain X and delete other rows, similarly among E,F,H weightage given to F, so retain F and delete other rows.

Expected Output:

output = pd.DataFrame([["A","X",98,56,61],["B","F",79,54,36],["A","W",48,51,85],["B","G",44,57,86]], columns=["id","class","c1","c2","c3"])

How to do it?

Chethan
  • 611
  • 3
  • 11
  • Use `df = df.drop_duplicates(['id','c1','c2','c3'])` – jezrael Mar 26 '21 at 14:29
  • 2
    its not clear why you want to retain `F` instead of `E` for id `B` for example, is there any specific weightage? can you please clarify – anky Mar 26 '21 at 14:29
  • yes there is specific weightage, if X,Y,Z are duplicates, retain X row, if E,F,H are duplicates then retain F row – Chethan Mar 26 '21 at 14:39
  • 1
    In that case., I do not think this is a duplicate question. reopened. But you should try to explain the question a bit more as it is slight confusing to read – anky Mar 26 '21 at 14:40
  • duplicate is on c1,c2 and c3 – Chethan Mar 26 '21 at 14:42
  • @Chethan Are you defining the weightage based on the given `id` so lets say for id `A` you want to keep only the duplicate rows where class is `X`, similarly for id `B` you want to keep duplicated rows where the corresponding class is `F`? – Shubham Sharma Mar 26 '21 at 15:00
  • 1
    no weightage only based on class column, if x,y,z then chose x row if the duplicate rows are present, if e,f,h then chose f row id duplicate is present. – Chethan Mar 26 '21 at 15:03

1 Answers1

2

Based on your explanation, you can create a dictionary of the weightage and then create 2 conditions and then do:

#add classes for weightage incase of duplicates
cls = ['X','F']
c = df.duplicated(['id','c1','c2','c3'],keep=False) 
out = df[(c&df['class'].isin(cls))|~c]

print(out)

  id class  c1  c2  c3
0  A     X  98  56  61
3  B     F  79  54  36
5  A     W  48  51  85
6  B     G  44  57  86
anky
  • 74,114
  • 11
  • 41
  • 70
  • here u have given weightage based on id column, for giving weightage i am not using id column, it is purely on class column, like if duplicate row is present of x,y,z chose x row, if duplicate row is present in e,f,h choose f row like this – Chethan Mar 26 '21 at 15:08
  • @Chethan updated answer to accommodate that requirement. – anky Mar 26 '21 at 15:10