0

Good morning guys. So my problem is to remove duplicates from dataframe caused by many diffrent values in one of columns.

The base dataframe looks like this below:

enter image description here

As You can see, I have duplicated values in columns Name and Id depends on Category. Our goal is to remove those duplicates while keeping the information about category.

I would like to have the exact view as here below:

enter image description here

I have tried to use get_dummies method from pandas library but i have some issues.

dummies = pd.get_dummies(df[['Category']], drop_first=True)
df = pd.concat([df.drop(['Category'], axis=1), dummies], axis=1)

Using the code above i'm getting the result like this below:

enter image description here

The result is basicly still the same as base dataframe.

Do You guys have any idea how to deal with it?

  • please provide the data as **text**, not images: [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – mozway Jan 27 '22 at 09:15

1 Answers1

1

It depends what need - if possible duplicates per Name and Id is necessary aggregate max:

df = (pd.get_dummies(df, columns=['Category'])
        .groupby(['Name','Id'], as_index=False)
         .max())
print (df)
  Name  Id  Category_A  Category_B  Category_C
0  ABC   1           1           0           0
1  ABC   2           0           1           0
2  DEF   2           1           0           0
3  GHI   3           0           0           1
4  JKL   4           0           0           1
5  MNO   5           1           0           0

If need aggregate per Id with last value for non numeric values use:

f = lambda x: x.max() if np.issubdtype(x.dtype, np.number) else x.iat[-1]
df = (pd.get_dummies(df, columns=['Category'])
        .groupby('Id', as_index=False)
         .agg(f))
print (df)
   Id Name  Category_A  Category_B  Category_C
0   1  ABC           1           0           0
1   2  DEF           1           1           0
2   3  GHI           0           0           1
3   4  JKL           0           0           1
4   5  MNO           1           0           0

In second solution is possible specify columns for aggregations:

# f = lambda x: x.max() if np.issubdtype(x.dtype, np.number) else x.iat[-1]
df = pd.get_dummies(df, columns=['Category'])
         
d = dict.fromkeys(df.columns, 'max')
d['Name'] = 'last'
print (d)
{'Name': 'last', 'Id': 'max', 'Category_A': 'max', 'Category_B': 'max', 'Category_C': 'max'}

df = df.groupby('Id', as_index=False).agg(d)
print (df)
  Name  Id  Category_A  Category_B  Category_C
0  ABC   1           1           0           0
1  DEF   2           1           1           0
2  GHI   3           0           0           1
3  JKL   4           0           0           1
4  MNO   5           1           0           0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252