1

Scenario

I have a dataframe with a given structure,and to sum it up at the end I want to find the time difference between the responses and requests of a service. It has the following columns:

  • Timestamp
  • Service
  • Command
  • Message_type
  • Message_ID

And an example of data would be:

Timestamp   Service      Command     Message_Type   Message_ID
12:00:00    FoodOrders  SeeStock()  Request        125
12:00:02    FoodOrders  SeeStock()  Response       125

The output would have to be something like

Service   Command   Message_ID  TimeDiff
FoodOrders  SeeStock  125       00:00:02

What have I thought of doing

Grouping by Service,Command,Message_ID and add an additional column with some function that calculates the difference of time.

My actual questions

  • Is my initial plan good? I am looking to try to make the code as clean and fast as possible

Thanks.

Alejandro A
  • 1,150
  • 1
  • 9
  • 28
  • When I looked at the problem, I thought of the same solution. So probably that might be a good idea to go ahead with. – Kenstars Jun 06 '19 at 08:32

4 Answers4

1

The plan is more or less OK. Note that for efficiency it would be better not to pass a lambda function directly to calculate a custom aggregation like TimeDiff. It is better to first calculate auxiliary aggregations that can be done with pandas / numpy built-ins and then compute your custom aggregation based on those.

GZ0
  • 4,055
  • 1
  • 10
  • 21
1

followed by this code from another post:

import time

start = time.time()
print("hello")
end = time.time()
print(end - start)

you can measure the time on your own.

Try oyur approach and the lambda to test it.

PV8
  • 5,799
  • 7
  • 43
  • 87
1

If you use jupiter notebook, you can try something like this:

%timeit df.sort_values('Time').groupby(['Service', 'Command', 'Message_Type', 'Message_ID']).apply(lambda x: x.iloc[1]['Time'] - x.iloc[0]['Time'])

In my sample I have this out:

2.97 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And I also think that it's a good plan = )

1

If performance is important, avoid aggregation and groupby, because slow, better is create Response and Response Series with MultiIndex and subtract Timestamps, sort_index should also help with performance:

#if necessary
#df['Timestamp'] = pd.to_timedelta(df['Timestamp'])

cols = ['Service','Command','Message_ID']
s1 = df[df['Message_Type'] == 'Response'].set_index(cols)['Timestamp'].sort_index()
s2 = df[df['Message_Type'] == 'Request'].set_index(cols)['Timestamp'].sort_index()

df1 = s1.sub(s2).reset_index()
print (df1)
      Service     Command  Message_ID Timestamp
0  FoodOrders  SeeStock()         125  00:00:02
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Hey I went for this approach finally, and it seems to be working as a charm. Thanks. My question now is, if you could, explain a little bit of the lines. What does setting the index based on those cols really mean? And then ordering by them index, mean that when doing the substraction it will hopefully be in the same order? Would this be robust if for example, for some message I just had a Response, and not a Request? – Alejandro A Jun 06 '19 at 11:04
  • 1
    @AlejandroA - ya, sure. Here is MultiIndex necessary because one important reason - if there is multiple rows of data is necessary correct matching between `'Service','Command','Message_ID'` values - so it means only values with same MultiIndex triples `'Service','Command','Message_ID'` are subtract from triples in another Series. It is called alignment. – jezrael Jun 06 '19 at 11:07