How best to parse through a largish list of time stamped events to identify concurrent entries?

Question

I have a largish (150,000) list of entries that are time stamped when they started and when they finished. I'm trying to identify the time when the maximum number of events were concurrent. My prefered language and example is in python.

The dictionary EVENTS holds the data, tag is an id for the event, Start and End are datetime objects the respective start and end times for each instance so:

EVENTS[tag][end][start] = [list of occurrences at that start/end time stamp]
endkey = EVENTS[tag].keys()
endkey.sort()
peak = 0
for end in endkey:
    endentrykey = EVENTS[tag].keys()
    endentrykey.sort()
    for endtime in endentrykey:
        if endtime < end:   #  We can disregard entries that ended before the event
            break
        startentrykey = EVENTS[tag][item].keys()
        startentrykey.sort()
        for starttime in startentrykey:
            if starttime > end: # ignore events that started after the event ended
               break  
            peak = len(EVENT[tag][endtime][starttime])

I haven't tried multi-threading this, but my strong suspicion is that I'm CPU bound.

Can someone suggest a better algorithm to accomplish this

score 0 · Answer 1 · answered Feb 21 '19 at 21:21

You could look at answers to this question on python profiling to help you identify bottle necks. There's a bit of sorting going on, which may be sub-optimal.

If your event data is sorted by start time then you should be able to iterate through the events in order and look ahead and behind to determine the count of overlapping events.

Here's an example of what I mean, I generate 150,000 sequentially starting events with a random duration. Then iterate through the events in the manner described above and print the top ten events with the most concurrencies.

>python py_counter.py
(7548: start 1546268748.0, duration: 89.0, 47)
(7547: start 1546268747.0, duration: 8.0, 46)
(7546: start 1546268746.0, duration: 86.0, 45)
(7545: start 1546268745.0, duration: 36.0, 44)
(7544: start 1546268744.0, duration: 21.0, 43)
(59089: start 1546320289.0, duration: 25.0, 43)
(7543: start 1546268743.0, duration: 75.0, 42)
(59088: start 1546320288.0, duration: 96.0, 42)
(104116: start 1546365316.0, duration: 70.0, 42)
(7542: start 1546268742.0, duration: 94.0, 41)
Finished 150,000 events in 5.1 seconds

Here's the code:

from datetime import datetime, timedelta
from dataclasses import dataclass
from collections import Counter
from random import randint
import time


@dataclass(eq=True, frozen=True)  # make hashable
class Event:
    """Track event durations"""
    tag: int
    start: datetime
    end: datetime

    def __repr__(self):
        return f"{self.tag}: start {self.start.timestamp()}, duration: {self.end.timestamp() - self.start.timestamp()}"

start_time = time.time()
# create some sequentially starting events with random durations
events = []
for i in range(150000):
    start = datetime(2019, 1, 1) + timedelta(seconds=i)
    end = start + timedelta(seconds=randint(1, 100))
    events.append(Event(i, start, end))

concurrencies = Counter()
#events is a list sorted by start-time
for i, event in enumerate(events):
    # look back
    j = i - 1
    while j > 0 and event.start < events[j].end:
        j -= 1
        concurrencies[event] += 1
    # look forward
    j = i + 1
    max_index = len(events) - 1
    while j < max_index and event.end < events[i].start:
        j += 1
        concurrencies[event] += 1
# print events with highest number of concurrencies
for event in concurrencies.most_common(10):
    print(event)
print(f"Finished {len(events):,d} events in {time.time() - start_time:.1f} seconds")

Let me know if any portions need clarifying.

Thanks for your suggestions, it took some head scratching to move from 3.x to 2.x (I certainly now appreciate how streamlined 3.x is in comparison). I think I failed to completely describe the problem. I need to aggregate the entries in the following example: start1 end1 # nth record start2 end2 # where start2 >= start1 and start2 <= end1 while ignoring the sequence: start1 end1 # nth record start3 end3 # where start3 < start1 or start3 > end1 — user3073001, Feb 22 '19 at 17:35
And excuse my ignorance. Your algorithm works perfectly. I hope to be able to add this to my bag of tricks for future use. To paraphrase Mike Myers. . . https://www.youtube.com/watch?v=wIE_ITidaKY — user3073001, Feb 23 '19 at 01:41
@user3073001 The _sequence_ start3 < start1 will not occur if your events are sorted by start time. The approach above will only work if the events are sort by start time, as mentioned in the answer. Note that it doesn't check for equality, but it's simple to change `<` to `<=`. — import random, Feb 24 '19 at 21:33

How best to parse through a largish list of time stamped events to identify concurrent entries?

1 Answers1