I'm trying to analyze a network traffic dataset with +1.000.000 of packets and I have the following code:
pcap_data = pd.read_csv('/home/alexfrancow/AAA/data1.csv')
pcap_data.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'len']
pcap_data['info'] = "null"
pcap_data.parse_dates=["time"]
pcap_data['num'] = 1
df = pcap_data
df
%%time
df['time'] = pd.to_datetime(df['time'])
df.index = df['time']
data = df.copy()
data_group = pd.DataFrame({'count': data.groupby(['ipdst', 'proto', data.index]).size()}).reset_index()
pd.options.display.float_format = '{:,.0f}'.format
data_group.index = data_group['time']
data_group
data_group2 = data_group.groupby(['ipdst','proto']).resample('5S', on='time').sum().reset_index().dropna()
data_group2
The first part of the script when I import the .csv runtime is 5 seconds, but when pandas groupby IP + PROTO, and resample the time in 5s, the runtime is 15 minutes, does anyone know how I can get a better performance?
EDIT:
Now I'm trying to use dask, and I have the following code:
Import the .csv
filename = '/home/alexfrancow/AAA/data1.csv'
df = dd.read_csv(filename)
df.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'info']
df.parse_dates=["time"]
df['num'] = 1
%time df.head(2)
Group by ipdst + proto by 5S freq
df.set_index('time').groupby(['ipdst','proto']).resample('5S', on='time').sum().reset_index()
How can I group by IP + PROTO by 5S frequency?