0

I have a CSV file which has multiple duplicate values in the row. I Would like to remove these duplicate values so I am only left with the unique values.

Dataframe:

 1                            2          3                   4           5                              6    
Bypass User Account Control  T3431      Elevated Execution   T3424      Bypass User Account Control    T3431
Local Account                T3523      Domain Account       T4252      Local Account                  T3523

Expected Dataframe:

  1                            2          3                   4           5                              6    
Bypass User Account Control  T3431      Elevated Execution   T3424      
Local Account                T3523      Domain Account       T4252                         

There are 100's of duplicate data in the rows and i would only like to see the unique values

Will
  • 255
  • 3
  • 14

2 Answers2

1

Convert each row to unique values with unique, output is array, so convert to Series:

df1 = df.apply(lambda x: pd.Series(x.unique()), axis=1)
print (df1)
                             0      1                   2      3
0  Bypass User Account Control  T3431  Elevated Execution  T3424
1                Local Account  T3523      Domain Account  T4252

Or:

df1 = df.apply(lambda x: x.drop_duplicates().reset_index(drop=True), axis=1)
print (df1)
                             0      1                   2      3
0  Bypass User Account Control  T3431  Elevated Execution  T3424
1                Local Account  T3523      Domain Account  T4252

Last for original columns names use:

df1.columns = df.columns[:len(df1.columns)]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Amazing thank you very much. is there a way i can keep my original headers in the file? – Will Feb 03 '21 at 11:14
  • @Will - There should be different each Series, so you can add `df1.columns = df.columns[:len(df1.columns)]` if `df1` is output `DataFrame` – jezrael Feb 03 '21 at 11:16
1

Use

(df.stack()
  .groupby(level=0).apply(lambda x: x.drop_duplicates())
  .unstack()
  .reset_index(drop=True))

result:

                             1      2                   3      4
0  Bypass User Account Control  T3431  Elevated Execution  T3424
1                Local Account  T3523      Domain Account  T4252
Ferris
  • 5,325
  • 1
  • 14
  • 23
wwnde
  • 26,119
  • 6
  • 18
  • 32