Check:
df = pd.DataFrame({
'date':['2019-01-01', '2019-01-01', '2019-01-01',
'2019-01-02', '2019-01-02', '2019-01-02'],
'user_id':[np.nan,4,4,5,5,4],
})
print (df)
date user_id
0 2019-01-01 NaN
1 2019-01-01 4.0
2 2019-01-01 4.0
3 2019-01-02 5.0
4 2019-01-02 5.0
5 2019-01-02 4.0
First difference is here is used aggregation, reduce rows by some aggregate function. Here is used Series.nunique
for count number of unique values, missing valeus are omitted, so for first group get 1
, for second 2
:
print (df.groupby("date").agg({"user_id": pd.Series.nunique}))
user_id
date
2019-01-01 1.0
2019-01-02 2.0
If is used DataFrameGroupBy.nunique
also is count missinf values like unique value:
print (df.groupby("date")["user_id"].nunique())
date
2019-01-01 2
2019-01-02 2
Name: user_id, dtype: int64
Looks like bug, because:
print (df.groupby("date")["user_id"].nunique(dropna=True))
date
2019-01-01 2
2019-01-02 2
Name: user_id, dtype: int64
print (df.groupby("date")["user_id"].nunique(dropna=False))
date
2019-01-01 2
2019-01-02 2
Name: user_id, dtype: int64
If use GroupBy.transform
then is applied aggregate function, but rows are not reduced, but same values are repeated for same group, so get new Series
with same length like original.
Also GroupBy.count
id used for count number of values with exclude missing values, so for first group get 2
and for second, because no missing values, 3
:
print (df.groupby("date")["user_id"].transform('count'))
0 2
1 2
2 2
3 3
4 3
5 3
Name: user_id, dtype: int64