0

I have some (very large) list of tuples that come from a database containing an id, start_time, and end_time

I also have a list of times in a regular interval and ordered (these are all datetime objects).

I basically need to loop through these times and find all the tuples in which the time falls within their range.

I'm wondering what the most efficient way to do this is. The first idea that comes to mind is like this (pseudo-code):

for time in times:
    for tuple in tuples:
        if tuple.start_time <= time <= tuple.end_time:
            # add tuple to some_other_list
        if tuple.end_time < time
            # remove tuple from tuples

The reason I'd do that is to iterate over a smaller and smaller list, hoping to cut some time there, however I am open to completely different approaches as well. I suppose another idea would be to just query the db with the given time each iteration but I'd imagine the latency there would far outweigh just having the full dataset in memory and working with it that way.

So for example, I would have a list of tuples where each tuple would look like:

[('783', datetime.datetime(2017, 12, 31, 20, 49, 28), datetime.datetime(2017, 12, 31, 23, 49, 28)), ('5274', datetime.datetime(2017, 12, 31, 20, 49, 45), datetime.datetime(2018, 1, 1, 0, 0)), ('757', datetime.datetime(2017, 12, 31, 20, 50, 25), datetime.datetime(2018, 1, 1, 1, 50, 25)), ('5600', datetime.datetime(2017, 12, 31, 20, 50, 59), datetime.datetime(2017, 12, 31, 23, 39)), ('5176', datetime.datetime(2017, 12, 31, 20, 51, 23), datetime.datetime(2018, 1, 1, 1, 51, 23)), ('5323', datetime.datetime(2017, 12, 31, 20, 52, 39), datetime.datetime(2018, 1, 1, 0, 0)), ('464', datetime.datetime(2017, 12, 31, 20, 52, 41), datetime.datetime(2018, 1, 1, 0, 52, 41))]

And the list of times would be stored in a generator basically using this answer, so looping through them would produce something like this:

2017-12-15 00:00:00
2017-12-22 00:00:00
2017-12-29 00:00:00
2018-01-05 00:00:00
2018-01-12 00:00:00
2018-01-19 00:00:00

And the actual output I'm fairly agnostic to, it would just be some dictionary of

{'2017-12-15 00:00:00': [list of matching ids], '2017-12-22 00:00:00': [list of matching ids], ...}

Any ideas or advice would be appreciated!

ShaneOH
  • 1,454
  • 1
  • 17
  • 29
  • Can you provide some example data with desired output? – jpp Mar 24 '18 at 14:28
  • It would just be something like `{1/1/2017 9am (maybe unix timestamp): [list of tuples where tuple.start_time <= 1/1/2017 9am <= tuple.end_time, '1/1/2017 10am: [list...], etc}` Where all 3 times (i.e. the data I'm working with, not necessarily the output) are datetime objects in this case – ShaneOH Mar 24 '18 at 14:33
  • OK. I recommend you add some data to your post *which people can copy/paste*. I know it sounds trivial, but this is the kind of thing which encourages more people to answer. – jpp Mar 24 '18 at 14:35
  • Sure thing, I edited to add some data, hopefully that helps! – ShaneOH Mar 24 '18 at 14:49
  • Just to be sure: you don't have so many times (or so many long, overlapping intervals) that the resulting all-at-once dictionary will take up too much memory, right? – Davis Herring Mar 24 '18 at 21:01

1 Answers1

1

First, a note about removing irrelevant intervals: if you do that from a (long) list, the performance will be terrible because of the need to shift later elements into the empty space. It's possible to work around this by instead replacing deleted elements with an integer that says how far to skip to find the next real data.

This is the classic interval-query problem, for which the usual answer is an interval or segment tree. However, if you can store all the results (and hence all the sorted query times) at once, a simple alternative is available: instead of iterating over the times and then searching the intervals, iterate once over all the intervals and perform binary searches to find the earliest and latest query times contained by each interval. Then append the interval's ID to a list maintained for each such time:

def ids(iv,tm):
  ret=[[] for _ in tm]
  for nm,l,h in iv:
    for i in range(bisect.bisect_left(tm,l),bisect.bisect_right(tm,h)):
      ret[i].append(nm)
  return ret

You can of course build a dictionary from the result with dict(zip(tm,ids(iv,tm))).

Davis Herring
  • 36,443
  • 4
  • 48
  • 76
  • Hey, thanks for the information. I had the "ah-ha" moment of switching to iterating over the times first (so swapping iteration order) shortly after posting the question, and your point of binary searching was a further improvement. This seems to work well, so thanks again! (one note of caution for anyone potentially reading in the future is to check for `bisect_left == -1` as this could produce unexpected results, depending on your use case) – ShaneOH Mar 26 '18 at 06:53
  • ah my mistake, that was a bug in my code! should have realized - yes to clarify, it does not return -1 – ShaneOH Mar 27 '18 at 11:51