I have the following data:
Out[6]:
Src Dst Port Application Start Date
0 0.0.0.0 1 1 1 2016-10-20 2016-10-20
1 00:00:0C:9F:F0:64 10 1 1 2016-10-20 2016-10-20
2 00:00:0C:9F:F0:65 3 1 1 2016-10-20 2016-10-20
3 00:00:0C:9F:F0:66 10 1 1 2016-10-20 2016-10-20
4 00:00:0C:9F:F0:67 42 1 1 2016-10-20 2016-10-20
In [7]: df.apply(lambda x: x.nunique())
Out[7]:
Src 791215
Dst 2599
Port 1
Application 44
Start 335
Date 15
dtype: int64
I want to know the number of unique values that each source has each day.
I wrote:
df_day = df.groupby(['Src', 'Date'], as_index=False).apply(lambda x: x.apply(lambda x: x.nunique()))
but it is incredibly slow (it runs forever). The number of groups is quiet large 791215 * 15
Is there any way I can speed up this computation?