I have large csv files of traffic data similar to the sample below, for which I need to calculate the total bytes and the duration of each data transfer. The time ranges are overlapping, but they must be merged:
first_packet_ts last_packet_ts bytes_uplink bytes_downlink service user_id
1441901695012 1441901696009 165 1212 facebook 3
1441901695500 1441901696212 23 4321 facebook 3
1441901698000 1441901698010 242 3423 youtube 4
1441901698400 1441901698500 423 2344 youtube 4
Desired output:
duration bytes_uplink bytes_downlink service user_id
1200 188 5533 facebook 3
110 665 5767 youtube 4
I currently use something like the following lines:
df = pd.read_csv(input_file_path)
df = df.groupby(['service', 'user_id'])
durations = df.apply(calculate_duration)
df = df[['bytes_uplink', 'bytes_downlink']].sum()
df = df.reset_index()
The calculate_duration function (below) iterates the contents of each group, merges the overlapping time intervals and then returns a dataframe which is then concatenated to the summed dataframe df.
def calculate_duration(group):
ranges = group[['first_packet_ts', 'last_packet_ts']].itertuples()
duration = 0
for i,current_start, current_stop in ranges:
for i, start, stop in ranges:
if start > current_stop:
duration += current_stop - current_start
current_start, current_stop = start, stop
else:
current_stop = max(current_stop, stop)
duration += current_stop - current_start
return duration
This approach is very slow as it involves iteration and invoking the apply method for each group.
Is there a more efficient way to calculate the duration of the data transfer, merging the overlapping intervals, using pandas (avoid iteration somehow?) preferably without resorting to cython?