1

I'm trying to analyze a network traffic dataset with +1.000.000 of packets and I have the following code:

pcap_data = pd.read_csv('/home/alexfrancow/AAA/data1.csv')
pcap_data.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'len']
pcap_data['info'] = "null"
pcap_data.parse_dates=["time"]

pcap_data['num'] = 1
df = pcap_data
df    

%%time
df['time'] = pd.to_datetime(df['time'])
df.index = df['time']
data = df.copy()

data_group = pd.DataFrame({'count': data.groupby(['ipdst', 'proto', data.index]).size()}).reset_index()
pd.options.display.float_format = '{:,.0f}'.format
data_group.index = data_group['time']
data_group

data_group2 = data_group.groupby(['ipdst','proto']).resample('5S', on='time').sum().reset_index().dropna()
data_group2

The first part of the script when I import the .csv runtime is 5 seconds, but when pandas groupby IP + PROTO, and resample the time in 5s, the runtime is 15 minutes, does anyone know how I can get a better performance?

EDIT:

Now I'm trying to use dask, and I have the following code:

Import the .csv

filename = '/home/alexfrancow/AAA/data1.csv'
df = dd.read_csv(filename)
df.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'info']
df.parse_dates=["time"]
df['num'] = 1
%time df.head(2)

Out

Group by ipdst + proto by 5S freq

df.set_index('time').groupby(['ipdst','proto']).resample('5S', on='time').sum().reset_index()

How can I group by IP + PROTO by 5S frequency?

2 Answers2

0

I try a bit simplify your code, but if large DataFrame performance should be only a bit better:

pd.options.display.float_format = '{:,.0f}'.format

#convert time column to DatetimeIndex
pcap_data = pd.read_csv('/home/alexfrancow/AAA/data1.csv', 
                         parse_dates=['time'], 
                         index_col=['time'])
pcap_data.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'len']
pcap_data['info'] = "null"
pcap_data['num'] = 1

#remove DataFrame constructor
data_group = pcap_data.groupby(['ipdst', 'proto', 'time']).size().reset_index(name='count')

data_group2 = (data_group.set_index('time')
                         .groupby(['ipdst','proto'])
                         .resample('5S')
                         .sum()
                         .reset_index()
                         .dropna())
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for the help, but it was not useful. I'm thinking in cutting the .csv in smaller parts and analyze parts in multithread, but I don't know how to make it. – alexfrancow Apr 26 '18 at 08:20
  • 1
    @AlejandroFranco - Not so easy, [dask](http://dask.pydata.org/en/latest/) should help, but not sure if `groupby` + `resample` implemented there. – jezrael Apr 26 '18 at 08:27
  • 1
    @AlejandroFranco - I think need [this](https://stackoverflow.com/questions/33945086/dask-dataframe-resample-over-groupby-object-with-multiple-rows) – jezrael Apr 27 '18 at 10:26
-1

in dask:

meta = pd.Dataframe(columns=['no','ipsrc','info'],dtype=object,  index=pd.MultiIndex([[], [],[]],[[],[], []], names=['ipdst','proto','time'])
df = df.set_index('time').groupby(['ipdst','proto']).apply(lambda x:x.resample('5S').sum(),meta=meta)
df = df.reset_index()

Hope it work for you

Cherrymelon
  • 412
  • 2
  • 7
  • 17