1

I have DataFrame with multiple columns and few columns contains list values. By considering only columns with list values in it, duplicate rows have to be deleted.

Current Dataframe:

ID    col1            col2            col3            col4
1     52         [kjd,pkh,sws]    [aqs,zxc,asd]   [plm,okn,ijb]
2     47         [qaz,wsx,edc]    [aws,rfc,tgb]   [rty,wer,dfg]
3     85         [kjd,pkh,sws]    [aqs,zxc,asd]   [plm,okn,ijb]
4     27         [asw,bxs,mdh]    [wka,kdy,kaw]   [pqm,lsc,yhb]

Desired output:

ID    col1            col2            col3            col4
2     47         [qaz,wsx,edc]    [aws,rfc,tgb]   [rty,wer,dfg]
4     27         [asw,bxs,mdh]    [wka,kdy,kaw]   [pqm,lsc,yhb]

I have tried converting it to tuple and apply df.drop_duplicates() but am getting multiple errors

KPH
  • 64
  • 8
  • No, I have tried this out. Since I have List in column values it produces error – KPH Jan 25 '21 at 04:38
  • `drop_duplicates` works with the example you gave for me. If you are still having trouble, just edit your post with more details about the issue you are having with `drop_duplicates`. – busybear Jan 25 '21 at 05:04
  • I think the post mentioned as the answer is not the true answer for this question and this post is the answer instead: https://stackoverflow.com/a/52949875/7206701 – Hoori M. Jan 25 '21 at 05:05
  • Does the elements' order in the duplicated list will change? – Ferris Jan 25 '21 at 05:05
  • @HooriM. This answer converts single column to tuple but my question has three columns with list values – KPH Jan 25 '21 at 05:09
  • @karthikph, see my answer. I think you will need to convert the lists into astype(str) and then drop duplicates – Joe Ferndz Jan 25 '21 at 05:14

1 Answers1

1

You can convert each of the columns with lists into str and then drop duplicates.

  • Step 1: Convert each column that has lists into a string type using astype(str).
  • Step 2: use drop_duplicates with the columns as strings. Since you want all duplicates to be removed, set keep=False.
  • Step 3: drop the temp created astype(str) columns as you no longer need them.

The full code will be:

c = ['col1','col2','col3','col4']
d =[[52,['kjd','pkh','sws'],['aqs','zxc','asd'],['plm','okn','ijb']],
    [47,['qaz','wsx','edc'],['aws','rfc','tgb'],['rty','wer','dfg']],
    [85,['kjd','pkh','sws'],['aqs','zxc','asd'],['plm','okn','ijb']],
    [27,['asw','bxs','mdh'],['wka','kdy','kaw'],['pqm','lsc','yhb']]]

import pandas as pd
df = pd.DataFrame(d,columns=c)
print(df)

df['col2s'] = df['col2'].astype(str)
df['col3s'] = df['col3'].astype(str)
df['col4s'] = df['col4'].astype(str)

df.drop_duplicates(subset=['col2s', 'col3s','col4s'],keep=False,inplace=True)
df.drop(['col2s', 'col3s','col4s'],axis=1,inplace=True)
print (df)

The output of this will be:

Original DataFrame:

   col1             col2             col3             col4
0    52  [kjd, pkh, sws]  [aqs, zxc, asd]  [plm, okn, ijb]
1    47  [qaz, wsx, edc]  [aws, rfc, tgb]  [rty, wer, dfg]
2    85  [kjd, pkh, sws]  [aqs, zxc, asd]  [plm, okn, ijb]
3    27  [asw, bxs, mdh]  [wka, kdy, kaw]  [pqm, lsc, yhb]

DataFrame after dropping the duplicates:

   col1             col2             col3             col4
1    47  [qaz, wsx, edc]  [aws, rfc, tgb]  [rty, wer, dfg]
3    27  [asw, bxs, mdh]  [wka, kdy, kaw]  [pqm, lsc, yhb]
ah bon
  • 9,293
  • 12
  • 65
  • 148
Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33