4

I have large csv files of traffic data similar to the sample below, for which I need to calculate the total bytes and the duration of each data transfer. The time ranges are overlapping, but they must be merged:

first_packet_ts last_packet_ts  bytes_uplink bytes_downlink service    user_id
1441901695012   1441901696009       165             1212    facebook    3
1441901695500   1441901696212        23             4321    facebook    3
1441901698000   1441901698010       242             3423    youtube     4
1441901698400   1441901698500       423             2344    youtube     4

Desired output:

 duration     bytes_uplink      bytes_downlink    service          user_id
   1200             188             5533          facebook            3
   110              665             5767          youtube             4   

I currently use something like the following lines:

df = pd.read_csv(input_file_path)
df = df.groupby(['service', 'user_id'])
durations = df.apply(calculate_duration) 
df = df[['bytes_uplink', 'bytes_downlink']].sum()
df = df.reset_index()

The calculate_duration function (below) iterates the contents of each group, merges the overlapping time intervals and then returns a dataframe which is then concatenated to the summed dataframe df.

def calculate_duration(group):
    ranges = group[['first_packet_ts', 'last_packet_ts']].itertuples()
    duration = 0
    for i,current_start, current_stop in ranges:
        for i, start, stop in ranges:
            if start > current_stop:
                duration += current_stop - current_start
                current_start, current_stop = start, stop
            else:
                current_stop = max(current_stop, stop)
        duration += current_stop - current_start
    return duration

This approach is very slow as it involves iteration and invoking the apply method for each group.

Is there a more efficient way to calculate the duration of the data transfer, merging the overlapping intervals, using pandas (avoid iteration somehow?) preferably without resorting to cython?

George Vakras
  • 93
  • 2
  • 8
  • Can you show `calculate_duration`? – chrisb Sep 11 '15 at 10:39
  • I added a version of my calculate_duration function. However I think that the goal is not to optimize the function, but if possible, do the calculation without using apply(). (Even if the function is empty, the performance hit is significant) – George Vakras Sep 11 '15 at 12:38

2 Answers2

1

How about this? (having timed it, might bit slower...)

pd.pivot_table(df, columns='user_id', index='service',
               values=['bytes_uplink', 'bytes_downlink'], aggfunc=sum)

Edit: I don't think this is any more valid than yours but you could try something along these lines:

# create dummy start/end dataframe
df = pd.DataFrame({'end':pd.Series([50, 100, 120, 150]), 'start':pd.Series([30, 0, 40, 130])})
df = df[['start', 'end']]
df = df.sort('start')

df['roll_end'] = df.end.cummax()
df.roll_end = df.roll_end.shift()

df['new_start'] = df.start
overlap = df.start - df.roll_end < 0
# if start is before rolling max end time then reset start to rolling max end time
df.new_start[overlap] = df.roll_end[overlap]

# if the new start is after end, then completely overlapping
print np.sum([x for x in df.end - df.new_start if x > 0])
matt_s
  • 1,037
  • 1
  • 10
  • 17
  • Thank you, however the question is about how to calculate the duration of the overlapping intervals that are grouped together. – George Vakras Sep 13 '15 at 12:24
  • @GeorgeV. Ok, sorry I was just trying to reproduce your result. Could you just use the difference of the max/min last/first within each group? Seems more meaningful to me but I obviously don't know what your end goal is. – matt_s Sep 14 '15 at 11:22
  • Well the ultimate goal would be to calculate the throughput for each service. Using the difference of the max/min last/first of each group is not meaningful in this context as it only takes into account the first and the last packet seen. I hoped that pandas or numpy would have such a feature that would easily allow to do such calculations without iterations. – George Vakras Sep 14 '15 at 17:10
  • @GeorgeV. thought about this slightly more and realised to use cummax() rather than rolling max. Have updated answer. – matt_s Sep 16 '15 at 10:00
0

The code below reproduces your output given the sample data. Is that what you're looking for?

>>> df.groupby(['service', 'user_id'])['bytes_uplink', 'bytes_downlink'].sum().reset_index()
    service  user_id  bytes_uplink  bytes_downlink
0  facebook        3           188            5533
1   youtube        4           665            5767
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • No, the output must contain the duration of the time interval. The problem is that the time intervals may overlap, so I cannot simply sum the duration for each row. – George Vakras Sep 13 '15 at 12:22
  • 1
    Perhaps you should update your sample data. Running your code on the sample data produces the same result. [ask] – Alexander Sep 13 '15 at 12:24
  • Yes, I do not show that I merge the result of calculate_duration with the result of the aggregation in the code above. I will update the question. – George Vakras Sep 13 '15 at 12:30