I have a dictionary of python dataframe
called df
. I want to split each dataframe
based on gap threshold of 4.5 on the time_epoch
column and then merge all the result as a single collection.
From the this question and this question, I came up with following code but I get an error:
keys= df.keys()
all = Counter()
for key in keys:
ids = (df[key]['time_epoch'] > (df[key]['time_epoch'].shift() + 4.5)).cumsum()
gp= df[key].groupby(ids)
all.update(Counter(dict(list(gp))))
I get the following error:
Traceback (most recent call last):
File "C:\Users\...\Miniconda3\lib\site-packages\pandas\core\ops.py", line 1176, in na_op
raise_on_error=True, **eval_kwargs)
File "C:\Users\...\Miniconda3\lib\site-packages\pandas\core\computation\expressions.py", line 211, in evaluate
**eval_kwargs)
File "C:\Users\...\Miniconda3\lib\site-packages\pandas\core\computation\expressions.py", line 64, in _evaluate_standard
return op(a, b)
TypeError: must be str, not int
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\...\Miniconda3\lib\site-packages\pandas\core\internals.py", line 1184, in eval
result = get_result(other)
File "C:\Users\...\Miniconda3\lib\site-packages\pandas\core\internals.py", line 1153, in get_result
result = func(values, other)
File "C:\Users\...\Miniconda3\lib\site-packages\pandas\core\ops.py", line 1202, in na_op
result[mask] = op(xrav, y)
TypeError: must be str, not int
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:/code.py", line 53, in <module>
function()
File "D:/code.py", line 41, in function
all.update(Counter(dict(list(flow_key))))
Edit1
My df
is created as follow:
dftemp = pd.read_csv(
"traffic.csv",
skipinitialspace=True,
usecols=[
'time_epoch', 'ip.src', 'ip.dst', 'tcp.srcport', 'tcp.dstport',
'frame.len', 'tcp.flags', 'Protocol',
],
na_filter=False,
encoding="utf-8")
complete = pd.read_csv(
"traffic.csv",
skipinitialspace=True,
usecols=[
'frame.time_epoch', 'ip.src', 'ip.dst', 'tcp.srcport',
'tcp.dstport', 'frame.len', 'tcp.flags', 'Protocol',
],
na_filter=False,
encoding="utf-8")
complete.loc[(complete['ip.dst'] == hostip[i]), 'frame.len'] = complete['frame.len'] * -1
complete.loc[(complete['frame.len'] < 0), 'ip.src'] = dftemp['ip.dst']
complete.loc[(complete['frame.len'] < 0), 'ip.dst'] = dftemp['ip.src']
complete.loc[(complete['frame.len'] < 0), 'tcp.srcport'] = dftemp['tcp.dstport']
complete.loc[(complete['frame.len'] < 0), 'tcp.dstport'] = dftemp['tcp.srcport']
complete_flow = complete.groupby(
['ip.src','ip.dst','tcp.srcport','tcp.dstport','Protocol'])
df = dict(list(complete_flow))
df
contains network traffic flows, which I want to split each flow using a threshold on packets timestamp gap.
Edit2
I find that counter only keep count of each key, so I iterate over new dictionary and create unique key for each, is there a pythonic way of doing this?
flows = {}
i = 1
for key in keys:
i += 1
flow_ids = (df[key]['time_epoch'] > (df[key]['time_epoch'].shift() + 4.5)).cumsum()
gp = df[key].groupby(ids)
df2 = dict(list(gp))
keys2 = df2.keys()
for i in keys2:
flows["%s, %s" % (key,i)] = df2[i]
del df2