0

I was learning few functions and came across these:

df.groupby("date").agg("user_id": pd.Series.nunique)

and

df.groupby("date")["user_id"].transform('count')

I would like to know and learn on the differences between the above two statements.

Thanks!

omdurg
  • 330
  • 3
  • 13

1 Answers1

2

Check:

df = pd.DataFrame({
        'date':['2019-01-01', '2019-01-01', '2019-01-01', 
                '2019-01-02', '2019-01-02', '2019-01-02'],
         'user_id':[np.nan,4,4,5,5,4],

})

print (df)
         date  user_id
0  2019-01-01      NaN
1  2019-01-01      4.0
2  2019-01-01      4.0
3  2019-01-02      5.0
4  2019-01-02      5.0
5  2019-01-02      4.0

First difference is here is used aggregation, reduce rows by some aggregate function. Here is used Series.nunique for count number of unique values, missing valeus are omitted, so for first group get 1, for second 2:

print (df.groupby("date").agg({"user_id": pd.Series.nunique}))
            user_id
date               
2019-01-01      1.0
2019-01-02      2.0

If is used DataFrameGroupBy.nunique also is count missinf values like unique value:

print (df.groupby("date")["user_id"].nunique())
date
2019-01-01    2
2019-01-02    2
Name: user_id, dtype: int64

Looks like bug, because:

print (df.groupby("date")["user_id"].nunique(dropna=True))
date
2019-01-01    2
2019-01-02    2
Name: user_id, dtype: int64

print (df.groupby("date")["user_id"].nunique(dropna=False))
date
2019-01-01    2
2019-01-02    2
Name: user_id, dtype: int64

If use GroupBy.transform then is applied aggregate function, but rows are not reduced, but same values are repeated for same group, so get new Series with same length like original.

Also GroupBy.count id used for count number of values with exclude missing values, so for first group get 2 and for second, because no missing values, 3:

print (df.groupby("date")["user_id"].transform('count'))
0    2
1    2
2    2
3    3
4    3
5    3
Name: user_id, dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Great explanation @jezrael, how could I use `distinct count` with `print (df.groupby("date")["user_id"].transform('count'))` – omdurg May 14 '20 at 12:05
  • @omdurg - Do you think `print (df.groupby("date")["user_id"].transform('nunique'))` ? – jezrael May 14 '20 at 12:06