I'm familiar with how to drop duplicate rows, and then using the parameter of first
, last
, none
. Nothing too complicated with that and there's plenty of examples (ie here).
However, what I'm looking for is there a way to find the duplicates, but instead of dropping all duplicates and keeping the first, if I have duplicates, keep all duplicates but drop the first:
So instead of "drop if duplicates, keep the first", I want "keep if duplicates, drop first"
Example:
Given this dataframe, and looking at duplicates in cost
column:
ID name type cost
0 0 a bb 1
1 1 a cc 2 <--- there are duplicates, so drop this row
2 1_0 a dd 2
3 2 a ee 3 <--- there are duplicates, so drop this row
4 2_0 a ff 3
5 2_1 a gg 3
6 2_2 a hh 3
If there are duplicates in the cost
column, just drop the first occurrence, but keep the rest.
So my output would be:
ID name type cost
0 0 a bb 1
2 1_0 a dd 2
4 2_0 a ff 3
5 2_1 a gg 3
6 2_2 a hh 3
Heres the sample dataframe:
import pandas as pd
df = pd.DataFrame([
['0', 'a', 'bb', '1'],
['1', 'a', 'cc', '2'],
['1_0', 'a', 'dd', '2'],
['2', 'a', 'ee', '3'],
['2_0', 'a', 'ff', '3'],
['2_1', 'a', 'gg', '3'],
['2_2', 'a', 'hh', '3']], columns = ['ID', 'name', 'type', 'cost'])