drop duplicated rows and add a count for the repeat count

Question

I would like to add a repeat count for the duplicate rows. current example only drop duplicate rows.

import re
import pandas as pd

data = [
    ['A','B','C'],
    ['A','B','C'],
    ['A','D','C'],
    ['A','D','C']
]

df = pd.DataFrame(data,columns=['x1','x2','x3'])
print(df)

df1 = df.drop_duplicates(keep='first')
print(df1)

expected output:

:   x1 x2 x3
: 0  A  B  C
: 1  A  B  C
: 2  A  D  C
: 3  A  D  C
:   x1 x2 x3 count
: 0  A  B  C 2
: 2  A  D  C 2

Most efficient option: `df.value_counts().reset_index(name='count')` — mozway, Jun 28 '23 at 08:24

score 1 · Accepted Answer · answered Jun 28 '23 at 02:25

1

Try this line:

df1 = df.groupby(['x1', 'x2', 'x3']).size().reset_index(name='count')

Here, we first group_by x1, x2, and x3, so there will be no duplicates and then count how many duplicates are grouped.

answered Jun 28 '23 at 02:25

Ehsan Hamzei

339
2
8

score 1 · Answer 2 · answered Jun 28 '23 at 02:29

You can use the pandas function duplicated, then groupby, count and add 1 (to count the first appearance), then some formatting to match your expectations:

data = [
    ['A','B','C'],
    ['A','B','C'],
    ['A','D','C'],
    ['A','D','C']
]

df = pd.DataFrame(data,columns=['x1','x2','x3'])
print(df)
df['dup'] = df.duplicated()
df = df[df.dup]
df = (df.drop(columns=['dup']).groupby(['x1', 'x2', 'x3']).value_counts() + 1).reset_index(name='count')

print(df)

Output:

  x1 x2 x3
0  A  B  C
1  A  B  C
2  A  D  C
3  A  D  C
  x1 x2 x3  count
0  A  B  C      2
1  A  D  C      2

drop duplicated rows and add a count for the repeat count

2 Answers2