2

I have a dataframe:

import pandas as pd

data = pd.DataFrame({"col1": ["a", "a", "a", "a", "a", "a"],
                     "col2": [0,0,0,1,1, 1],
                     "col3": [1,2,3,4,5, 6]})

data


  col1  col2    col3
0   a   0   1
1   a   0   2
2   a   0   3
3   a   1   4
4   a   1   5
5   a   1   6

I'm trying to remove the duplicates based on col2 == 1 and keep the last entry

Using the below code I was able to keep the first and drop others.

data[~(data.duplicated(["col2"]) & data.col2.eq(1))]
col1    col2    col3
0   a   0   1
1   a   0   2
2   a   0   3
3   a   1   4

How to remove duplicates based on one category in a column and keep the last entry?

Desired Output

  col1  col2    col3
0   a   0   1
1   a   0   2
2   a   0   3
3   a   1   6
Ailurophile
  • 2,552
  • 7
  • 21
  • 46

2 Answers2

2

Use boolean indexing with help of groupby.cumcount (or duplicated):

# is the row not a 1 in col2?
m1 = data['col2'].ne(1)
# is the row the last of the group?
m2 = data.groupby('col2').cumcount(ascending=False).eq(0)
# or
# m2 = ~data['col2'].duplicated(keep='last')

# keep rows matching either condition
out = data[m1|m2]

Reversed logic:

# is the group a 1?
m1 = data['col2'].eq(1)
# is the row NOT the last one?
m2 = data['col2'].duplicated(keep='last')

# drop the rows matching both conditions
out = data[~(m1&m2)]

Output:

  col1  col2  col3
0    a     0     1
1    a     0     2
2    a     0     3
5    a     1     6
mozway
  • 194,879
  • 13
  • 39
  • 75
0

You can also concat the last row of the removed to the kept dataframe.

m = df['col2'].eq(1)
out = pd.concat([df[~m], df[m].iloc[[-1]]]).sort_index()
print(out)

  col1  col2  col3
0    a     0     1
1    a     0     2
2    a     0     3
5    a     1     6
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52