How Drop all duplicate rows by keeping first in python panda

Question

Hi I want to remove all duplicates rows from panda dataframe by only keeping first. This is what i am doing.

import pandas as pd
df = pd.DataFrame({'col1':['A']*3+['B']*4+['C','B','A'],'col2':[2,3,4,2,4,2,1,3,4,4]})
print(df)
df.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=True)

This is fine but the given solution is exceeding time limit in my system. Can someone provide a better solution?

Have you tried this: https://stackoverflow.com/questions/54196959/is-there-any-faster-alternative-to-col-drop-duplicates — Minh-Long Luu, Sep 29 '21 at 03:40

score 0 · Answer 1 · answered Sep 29 '21 at 03:49

0

I expect it to be much quicker with NumPy:

>>> pd.DataFrame(np.unique(df.to_numpy(dtype=str), axis=0), columns=df.columns)
  col1 col2
0    A    2
1    A    3
2    A    4
3    B    1
4    B    2
5    B    4
6    C    3
>>>

Using np.unique.

answered Sep 29 '21 at 03:49

U13-Forward

69,221
14
89
114

How Drop all duplicate rows by keeping first in python panda

1 Answers1