0

I have a dictionary of python dataframe called df. I want to split each dataframe based on gap threshold of 4.5 on the time_epoch column and then merge all the result as a single collection.

From the this question and this question, I came up with following code but I get an error:

keys= df.keys()    
all = Counter()
for key in keys:
    ids = (df[key]['time_epoch'] > (df[key]['time_epoch'].shift() + 4.5)).cumsum()
    gp= df[key].groupby(ids)
    all.update(Counter(dict(list(gp))))

I get the following error:

Traceback (most recent call last):
File "C:\Users\...\Miniconda3\lib\site-packages\pandas\core\ops.py", line 1176, in na_op
    raise_on_error=True, **eval_kwargs)
File "C:\Users\...\Miniconda3\lib\site-packages\pandas\core\computation\expressions.py", line 211, in evaluate
    **eval_kwargs)
File "C:\Users\...\Miniconda3\lib\site-packages\pandas\core\computation\expressions.py", line 64, in _evaluate_standard
    return op(a, b)
TypeError: must be str, not int

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File  "C:\Users\...\Miniconda3\lib\site-packages\pandas\core\internals.py",  line 1184, in eval
     result = get_result(other)
File "C:\Users\...\Miniconda3\lib\site-packages\pandas\core\internals.py",  line 1153, in get_result
     result = func(values, other)
File "C:\Users\...\Miniconda3\lib\site-packages\pandas\core\ops.py", line  1202, in na_op
     result[mask] = op(xrav, y)
TypeError: must be str, not int

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:/code.py", line 53, in  <module>
     function()
File "D:/code.py", line 41, in function
     all.update(Counter(dict(list(flow_key))))

Edit1

My df is created as follow:

dftemp = pd.read_csv(
    "traffic.csv",
    skipinitialspace=True,
    usecols=[
        'time_epoch', 'ip.src', 'ip.dst', 'tcp.srcport', 'tcp.dstport',
        'frame.len', 'tcp.flags', 'Protocol',
    ],
    na_filter=False,
    encoding="utf-8")
complete = pd.read_csv(
    "traffic.csv",
    skipinitialspace=True,
    usecols=[
        'frame.time_epoch', 'ip.src', 'ip.dst', 'tcp.srcport',
        'tcp.dstport', 'frame.len', 'tcp.flags', 'Protocol',
    ],
    na_filter=False,
    encoding="utf-8")

complete.loc[(complete['ip.dst'] == hostip[i]), 'frame.len'] = complete['frame.len'] * -1
complete.loc[(complete['frame.len'] < 0), 'ip.src'] = dftemp['ip.dst']
complete.loc[(complete['frame.len'] < 0), 'ip.dst'] = dftemp['ip.src']
complete.loc[(complete['frame.len'] < 0), 'tcp.srcport'] = dftemp['tcp.dstport']
complete.loc[(complete['frame.len'] < 0), 'tcp.dstport'] = dftemp['tcp.srcport']

complete_flow = complete.groupby(
    ['ip.src','ip.dst','tcp.srcport','tcp.dstport','Protocol'])
df = dict(list(complete_flow))

df contains network traffic flows, which I want to split each flow using a threshold on packets timestamp gap.

Edit2

I find that counter only keep count of each key, so I iterate over new dictionary and create unique key for each, is there a pythonic way of doing this?

flows = {}
i = 1
for key in keys:
    i += 1
    flow_ids = (df[key]['time_epoch'] > (df[key]['time_epoch'].shift() + 4.5)).cumsum()
    gp = df[key].groupby(ids)
    df2 = dict(list(gp))
    keys2 = df2.keys()
    for i in keys2:
        flows["%s, %s" % (key,i)] = df2[i]
    del df2
Ralf
  • 16,086
  • 4
  • 44
  • 68
user3806649
  • 1,257
  • 2
  • 18
  • 42
  • Could you please help me in order to improve the question, instead of vote for close? or at least explain your reason? – user3806649 Nov 29 '17 at 23:52
  • Can you please post a sample of your dataframe – Alekhya Vemavarapu Nov 30 '17 at 00:09
  • @AlekhyaVemavarapu, I added the explanation of the creation of the dataframe. – user3806649 Nov 30 '17 at 08:35
  • @user3806649 In your **Edit2**, you have 2 occurrences of `i` (`i += 1` and `for i in ...`): this is definitely a mistake somewhere in your code (in particular, why do you have `i` in the first place since you don't use it). Moreover, using `dict( list( ...))` and then just using the keys of the dictionary seems strange. – Jean-Francois T. Apr 27 '18 at 06:22
  • @Jean-FrancoisT., the first use of i is additional as you mentioned. about the (key, value) pair, as I want to merge multiple dictionaries, the merged dict (`flows`) has the same value as the df2. `flows["%s, %s" % (key,i)] = df2[i]` – user3806649 Apr 28 '18 at 06:51
  • @user3806649 Maybe you could have done someting like `flows += { "%s, %s" % (key,k):v for k, v in dict(list(gp)).items() }`. – Jean-Francois T. May 02 '18 at 03:09

0 Answers0