Grouping Messages by Time Intervals

Question

I'm currently trying to group messages that are sent out by 1 second time intervals. I'm currently calculating time latency with this:

def time_deltas(infile): 
entries = (line.split() for line in open(INFILE, "r")) 
ts = {}
for e in entries: 
    if " ".join(e[2:5]) == "T out: [O]": 
        ts[e[8]] = e[0]    
    elif " ".join(e[2:5]) == "T in: [A]":    
        in_ts, ref_id = e[0], e[7] 
        out_ts = ts.pop(ref_id, None) 
        yield (float(out_ts),ref_id[1:-1],(float(in_ts)*1000 - float(out_ts)*1000))

INFILE = 'C:/Users/klee/Documents/test.txt'
import csv 

with open('test.csv', 'w') as f: 
csv.writer(f).writerows(time_deltas(INFILE))

HOWEVER I want to calculate the number of "T in: [A]" messages per second that are sent out, and have been trying to work with this to do so:

import datetime
import bisect
import collections

data=[ (datetime.datetime(2010, 2, 26, 12, 8, 17), 5594813L), 
  (datetime.datetime(2010, 2, 26, 12, 7, 31), 5594810L), 
  (datetime.datetime(2010, 2, 26, 12, 6, 4) , 5594807L),
]
interval=datetime.timedelta(seconds=50)
start=datetime.datetime(2010, 2, 26, 12, 6, 4)
grid=[start+n*interval for n in range(10)]
bins=collections.defaultdict(list)
for date,num in data:
idx=bisect.bisect(grid,date)
   bins[idx].append(num)
for idx,nums in bins.iteritems():
print('{0} --- {1}'.format(grid[idx],len(nums)))

which can be found here: Python: group results by time intervals

(I realize the units would be off for what I want, but I'm just looking into the general idea...)

I've been mostly unsuccessful thus far and would appreciate any help.

Also, The data appears as:

082438.577652 - T in: [A] accepted. ordID [F25Q6] timestamp [082438.575880] RefNumber [6018786] State [L]

Shawn Chin · Accepted Answer · 2012-01-16T15:22:39.377

3

Assuming you want to group your data by those issued within 1 second intervals on the second, we can make use of the fact that your data is ordered and that int(out_ts) truncates the timestamp to the second which we can use as a grouping key.

Simplest way to do the grouping would be to use itertools.groupby:

from itertools import groupby

data = get_time_deltas(INFILE)  
get_key = lambda x: int(x[0])  # function to get group key from data
bins = [(k, list(g)) for k, g in groupby(data, get_key)]

bins will be a list of tuples where the first value in the tuple is the key (integer, e.g. 082438) and the second value is the a list of data entries that were issued on that second (with timestamp = 082438.*).

Example usage:

# print out the number of messages for each second
for sec, data in bins:
    print('{0} --- {1}'.format(sec, len(data)))

# write (sec, msg_per_sec) out to CSV file
import csv
with open("test.csv", "w") as f:
    csv.writer(f).writerows((s, len(d)) for s, d in bins)

# get average message per second
message_counts = [len(d) for s, d in bins]
avg_msg_per_second = float(sum(message_count)) / len(message_count)

P.S. In this example, a list was used for bins so that the order of data is maintained. If you need random access to the data, consider using an OrderedDict instead.

Note that it is relatively straight-forward to adapt the solution to group by multiples of seconds. For example, to group by messages per minute (60 seconds), change the get_key function to:

get_key = lambda x: int(x[0] / 60)  # truncate timestamp to the minute

edited Jan 16 '12 at 15:22

answered Jan 12 '12 at 09:59

Shawn Chin

84,080
19
162
191

I'm sorry, how can I write it to a csv file? – eunhealee Jan 13 '12 at 19:49
One more question, if I wanted to calculate the number of out messages, how could I do this? – eunhealee Jan 13 '12 at 21:23
@eunhaelee What fields would you like to write to csv? – Shawn Chin Jan 14 '12 at 16:39
And do you mean total messages, or messages for each interval? – Shawn Chin Jan 14 '12 at 16:40
The content important for the csv file is what is printed in python. (ex. 120056 - 3) – eunhealee Jan 16 '12 at 14:05
@eunhealee see updated answer (second example in "Example usage") – Shawn Chin Jan 16 '12 at 15:23
For the messages: All the "T out: [O]" messages per one second interval – eunhealee Jan 16 '12 at 15:27
@eunhealee I believe that is what is being written out in the example above, i.e. the number of "T out: [O]" for each one second interval. – Shawn Chin Jan 16 '12 at 16:00

S.Lott · Answer 2 · 2012-01-12T00:21:44.633

1

This is easier if you don't base your grid on time intervals with bisection.

Instead, do this. Transform each interval to a single number.

def map_time_to_interval_number( epoch, times )
    for t in times:
        delta= (t - epoch)
        delta_t= delta.days*60*60*24 + delta.seconds + delta.microseconds/1000000.0
        interval = delta_t / 50
        yield interval, t

counts = defaultdict( int )
epoch = min( data ) 
for interval, time in map_time_to_interval_number( epoch, data ):
    counts[interval] += 1

The interval will be an integer. 0 is the first 50-second interval. 1 is the second 50-second interval. etc.

You can reconstruct the timestamp from the interval knowing that each interval is 50-seconds wide and begins at epoch.

edited Jan 12 '12 at 00:21

answered Jan 11 '12 at 21:23

S.Lott

384,516
81
508
779

I'm having difficulty with this: NameError: name 'defaultdict' is not defined. I apologize if I'm just not familiar with this. – eunhealee Jan 11 '12 at 21:39
You're having difficulty because Google is broken. Here's the first hit on a Google search for "python defaultdict". http://docs.python.org/library/collections.html It's important that you read and understand this library. – S.Lott Jan 11 '12 at 21:40

Grouping Messages by Time Intervals

2 Answers2

Linked