2

I've a data frame of about 52000 rows with some duplicates, when I use

df_drop_duplicates() 

I loose about 1000 rows, but I don't want to erase this rows I want to know which ones are the duplicates rows

Luis Ramon Ramirez Rodriguez
  • 9,591
  • 27
  • 102
  • 181
  • Does this answer your question? [How do I get a list of all the duplicate items using pandas in python?](https://stackoverflow.com/questions/14657241/how-do-i-get-a-list-of-all-the-duplicate-items-using-pandas-in-python) – Abu Shoeb Apr 27 '21 at 17:08

2 Answers2

10

You could use duplicated for that:

df[df.duplicated()]

You could specify keep argument for what you want, from docs:

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Mark duplicates as True except for the first occurrence.
  • last : Mark duplicates as True except for the last occurrence.
  • False : Mark all duplicates as True.
Community
  • 1
  • 1
Anton Protopopov
  • 30,354
  • 12
  • 88
  • 93
0

To identify duplicates within a pandas column without dropping the duplicates, try:

Let 'Column_A' = column with duplicate entries 'Column_B' = a true/false column that marks duplicates in Column A.

df['Column_B'] = df.duplicated(subset='Column_A', keep='first')

Change the parameters to fine tune to your needs.

Arthur D. Howland
  • 4,363
  • 3
  • 21
  • 31