1

I have the following data:

Out[6]: 

                 Src  Dst  Port  Application      Start        Date
0            0.0.0.0    1     1            1 2016-10-20  2016-10-20
1  00:00:0C:9F:F0:64   10     1            1 2016-10-20  2016-10-20
2  00:00:0C:9F:F0:65    3     1            1 2016-10-20  2016-10-20
3  00:00:0C:9F:F0:66   10     1            1 2016-10-20  2016-10-20
4  00:00:0C:9F:F0:67   42     1            1 2016-10-20  2016-10-20

In [7]: df.apply(lambda x: x.nunique())
Out[7]: 

Src            791215
Dst              2599
Port                1
Application        44
Start             335
Date               15
dtype: int64

I want to know the number of unique values that each source has each day. I wrote: df_day = df.groupby(['Src', 'Date'], as_index=False).apply(lambda x: x.apply(lambda x: x.nunique()))

but it is incredibly slow (it runs forever). The number of groups is quiet large 791215 * 15

Is there any way I can speed up this computation?

Donbeo
  • 17,067
  • 37
  • 114
  • 188
  • which pandas version are you using? actually there were some big performance improvements in the 0.19 release of pandas – Quickbeam2k1 Dec 01 '16 at 10:40
  • `In [6]: pd.__version__ Out[6]: u'0.19.1' ` The data is quiet big: `In [7]: df.shape Out[7]: (10076686, 6) ` – Donbeo Dec 01 '16 at 10:57
  • why the nested apply? what is a unique __value__? the unique values for dst, port, application and start? – Quickbeam2k1 Dec 01 '16 at 11:20
  • yes. For each couple (`Src, date`) I want to know the number of unique values of the other columns. In other words for each day I want to know to how many different `Dst, Port,...` each ip has connected. – Donbeo Dec 01 '16 at 11:25
  • okay, i just checked your code, and now I know why you use the double apply. I just tried to use only one apply and a for loop to iterate over the columns. On a small dataframe this is 15% slower. What may save time here, ist that you don't need to calculate the src - date unique values (which you still do, though you know the result beforehand). This takes for my small data frame roughly 50% of the clocktime – Quickbeam2k1 Dec 01 '16 at 11:56
  • I fear I don't see something else here. Another idea could be to use spark? At least it sounds like a nicely parallelizable task. Additionally, you could try to parallelize your computations, which pandas unfortunately doesn't do. – Quickbeam2k1 Dec 01 '16 at 12:19
  • thanks I'll have a look at other libraries. Apparently there isn't any easy solution with pandas – Donbeo Dec 01 '16 at 13:22
  • check this [question](http://stackoverflow.com/q/40805769/2901002) – jezrael Dec 01 '16 at 14:05

0 Answers0