2

Hi I want to remove all duplicates rows from panda dataframe by only keeping first. This is what i am doing.

import pandas as pd
df = pd.DataFrame({'col1':['A']*3+['B']*4+['C','B','A'],'col2':[2,3,4,2,4,2,1,3,4,4]})
print(df)
df.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=True)

This is fine but the given solution is exceeding time limit in my system. Can someone provide a better solution?

Sam
  • 31
  • 3
  • Have you tried this: https://stackoverflow.com/questions/54196959/is-there-any-faster-alternative-to-col-drop-duplicates – Minh-Long Luu Sep 29 '21 at 03:40

1 Answers1

0

I expect it to be much quicker with NumPy:

>>> pd.DataFrame(np.unique(df.to_numpy(dtype=str), axis=0), columns=df.columns)
  col1 col2
0    A    2
1    A    3
2    A    4
3    B    1
4    B    2
5    B    4
6    C    3
>>> 

Using np.unique.

U13-Forward
  • 69,221
  • 14
  • 89
  • 114