Using `drop_duplicates` on a Pandas dataframe isn't dropping rows

Question

Situation

I have dataframe similar to below ( although I've removed many of the rows for this example, as evidenced in the 'index' column):

df

index	id	name	last_updated
0	1518	Maker	2022-12-31T03:02:00.000Z
1	1518	Maker	2022-12-31T02:02:00.000Z
2	1518	Maker	2022-12-31T14:02:00.000Z
3	1518	Maker	2022-12-31T16:02:00.000Z
23	1518	Maker	2022-12-31T17:02:00.000Z
24	2280	Filecoin	2022-12-31T01:02:00.000Z
25	2280	Filecoin	2022-12-31T03:01:00.000Z
26	2280	Filecoin	2022-12-31T02:01:00.000Z
27	2280	Filecoin	2022-12-31T00:02:00.000Z
47	2280	Filecoin	2022-12-31T08:02:00.000Z
48	4558	Flow	2022-12-31T01:02:00.000Z
49	4558	Flow	2022-12-31T02:01:00.000Z
71	4558	Flow	2022-12-31T05:02:00.000Z
72	5026	Orchid	2022-12-31T01:02:00.000Z
73	5026	Orchid	2022-12-31T03:02:00.000Z
74	5026	Orchid	2022-12-31T02:01:00.000Z
75	5026	Orchid	2022-12-31T00:02:00.000Z

I want a version of the above dataframe but with only 1 row for each id parameter. Keeping the last instance.

This is my code:

df.drop_duplicates(subset=['id'], keep='last')

Expectation

That the new df would retain only 4 rows, the 'last' instance for each 'id' value in dataframe df.

Result

After running the drop_duplicates command, the df returns the exact same dataframe. Same shape as prior to my drop_duplicates attempt.

I've been trying to use this post to sort it out, but obvs there's something I'm not getting right:

pandas select rows with no duplicate

I'd appreciate any input on why the last instance of rows with duplicate 'id' values are not being dropped.

Freek de Kreek · Accepted Answer · 2023-01-09T15:14:42.540

2

You should add df.drop_duplicates(subset=['id'], keep='last', inplace=True). If you don't do this, only a copy is returned. By specifying inplace=True, the dataframe is modified.

See documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

Hope this helps!

edited Jan 09 '23 at 15:14

answered Jan 09 '23 at 15:14

Freek de Kreek

44
3

1

Thank you both! That returned the result I wanted. I did get a notice along with the results: `A value is trying to be set on a copy of a slice from a DataFrame`. Is that simply because I'm reusing the same df name? If I change the df name (e.g., `df1 = df.drop_duplicates(subset=['id'], keep='last', inplace=True)`, then print `df1`, it returns `None`. – dsx Jan 09 '23 at 15:30
Yes I think it is caused because you use the same name for the df. Maybe this answer helps: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas. Glad it helped! – Freek de Kreek Jan 09 '23 at 15:34
Or you can simply leave the reassignment `df1 = df.drop_duplicates(subset=['id'], keep='last', inplace=True)` and do `df.drop_duplicates(subset=['id'], keep='last', inplace=True)` instead – Freek de Kreek Jan 09 '23 at 15:41
@dsx if any of the answers work for you - do not forget to mark it as accepted one (checkmark to the left of the answer) – Guru Stron Jan 09 '23 at 15:57
Thx! i think I'll keep it simple and go with `df.drop_duplicates(subset=['id'], keep='last', inplace=True)` – dsx Jan 09 '23 at 21:58

score 0 · Answer 2 · answered Jan 09 '23 at 15:18

Either reassign the variable:

str = """
index   id  name    last_updated
0   1518    Maker   2022-12-31T03:02:00.000Z
1   1518    Maker   2022-12-31T02:02:00.000Z
2   1518    Maker   2022-12-31T14:02:00.000Z
3   1518    Maker   2022-12-31T16:02:00.000Z
23  1518    Maker   2022-12-31T17:02:00.000Z
24  2280    Filecoin    2022-12-31T01:02:00.000Z
25  2280    Filecoin    2022-12-31T03:01:00.000Z
26  2280    Filecoin    2022-12-31T02:01:00.000Z
27  2280    Filecoin    2022-12-31T00:02:00.000Z
47  2280    Filecoin    2022-12-31T08:02:00.000Z
48  4558    Flow    2022-12-31T01:02:00.000Z
49  4558    Flow    2022-12-31T02:01:00.000Z
71  4558    Flow    2022-12-31T05:02:00.000Z
72  5026    Orchid  2022-12-31T01:02:00.000Z
73  5026    Orchid  2022-12-31T03:02:00.000Z
74  5026    Orchid  2022-12-31T02:01:00.000Z
75  5026    Orchid  2022-12-31T00:02:00.000Z
"""

csvStringIO = StringIO(str)
df = pd.read_csv(csvStringIO, sep="\t")
df = df.drop_duplicates(subset=['id'], keep='last')
print(df) # 4 rows

Or set inplace to True:

df.drop_duplicates(subset=['id'], keep='last', inplace=True)
print(df) # 4 rows

Using `drop_duplicates` on a Pandas dataframe isn't dropping rows

2 Answers2