0

Below example count duplicate numbers by column x1 and x2. the output has x1,x2 and count., I expect it keep x3 (first row of duplicates) as well.

import re
import pandas as pd

data = [
    ['A','B','C'],
    ['A','B','D'],
    ['A','D','C'],
    ['A','D','C']
]

df = pd.DataFrame(data,columns=['x1','x2','x3'])
print(df)

df1 = df.groupby(['x1','x2']).size().reset_index()
print(df1)

current output:

  x1 x2 x3
0  A  B  C
1  A  B  D
2  A  D  C
3  A  D  C

  x1 x2  0
0  A  B  2
1  A  D  2

expected output:

  x1 x2 x3 0
0  A  B C  2
1  A  D C  2
Corralien
  • 109,409
  • 8
  • 28
  • 52
lucky1928
  • 8,708
  • 10
  • 43
  • 92

1 Answers1

2

You can use groupby_agg to keep x3 in the output result:

>>> (df.groupby(['x1','x2'])
        .agg(x3=('x3', 'first'), cnt=('x1', 'size'))
        .reset_index())

  x1 x2 x3  cnt
0  A  B  C    2
1  A  D  C    2
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • great, what's the reason include everything into "()"? – lucky1928 Jun 28 '23 at 04:42
  • 1
    This is a [named aggregation](https://pandas.pydata.org/docs/user_guide/groupby.html#groupby-aggregate-named). `agg` expects keyword arguments with the following format: `new_column=(column, function)` => the `function` is applied on `column` and the result is stored in `new_column`. – Corralien Jun 28 '23 at 04:51
  • I'm sorry I did not understand. I enclosed the code in `( )` to have a single instruction on multiple lines. It avoid "\" at the end of each line. (This avoid the scrollbar). More information [here](https://stackoverflow.com/q/53162/15239951) – Corralien Jun 28 '23 at 04:54