25

I have some data like this:

1: 2 - 10
2: 3 - 15
3: 4 - 9
4: 8 - 14
5: 7 - 13
6: 5 - 10
7: 11 - 15

I will attempt a representation to make it clearer:

        1     2     3     4     5     6     7     8     9     10     11     12     13     14     15
1             |--------------------------------------X---------|
2                   |--------------------------------X--------------------------------------------|
3                         |--------------------------X---|
4                                                  |-X-------------------------------------|
5                                           |--------X------------------------------|
6                               |--------------------X----------|
7                                                                     |---------------------------|

So in the example case, 8-9 is the critical period if the second scheme is used because all the points are active. What is a fast and good way to solving this problem in python? I am thinking of using dynamic programming but are there other approaches that are suggested?

My approach until now:

I was thinking more from a real-time perspective. So, whenever I get a new point, I do this: Assume I already got 2-10 and I get 3-15 then I pick the max of start and min of end so this case it is 3-10 and increment this interval's count to 2. Then the third point comes in 4-9, pick the max which is 4 and the min is 9 and update the value 3-10 to 4-9 and update count to 3. Now when 8-14 comes in, I pick the start of this interval is greater than 4-9 and the end of this interval is less than 4-9. In this case, it is not true so I will create a new bucket 8-14 and I put the count to 1. This is not the entire algorithm but should give a high-level idea of what I am doing here. I will see if I can sketch the pseudo-code.

Legend
  • 113,822
  • 119
  • 272
  • 400
  • Is there anything similar between the data? IE, are they shifts and will never be less than n or something like that? – Robert Apr 24 '11 at 04:33
  • 1
    Does this help: http://stackoverflow.com/questions/143552/comparing-date-ranges – sjr Apr 24 '11 at 04:34
  • 1
    @sjr: unless I'm misunderstanding that's not what he's trying to do. In that example it's given a span of time and you want to know everything that is active between the start and finish of one timeline. Sounds like he's wanting to know the busiest range of multiple timestamps with no single comparative source. – Robert Apr 24 '11 at 04:38
  • @Robert: Regarding your first point, I am not sure I understand your question properly but regarding the second point, that is correct. @sjr: Looking at it now. Thank you. – Legend Apr 24 '11 at 04:40
  • @Legend: Just wondering if there's anything known about the data. Are they shifts that will never be less than 4, or anything like that? – Robert Apr 24 '11 at 04:42
  • @Robert: Oh... To the best of my knowledge, I don't think so. – Legend Apr 24 '11 at 04:44
  • reminds me of http://stackoverflow.com/questions/4620273/combining-values-for-a-large-number-of-overlapping-intervals-of-dictionary-keys/4620897#4620897 the same idea with the heapq works here too, but this problem is actually simpler. – Jochen Ritzel Apr 24 '11 at 04:47
  • Ok, this might be good too: http://stackoverflow.com/questions/149577/need-an-algorithm-for-collapsing-netblock-ranges-into-lists-of-superset-ranges This allows you to collapse the overlapping ranges. You need to add a step at the end which is to find *which* range is shared amongst all of the overlapping ranges, which will be the max starting time and min finish time of all of the overlapping ranges. – sjr Apr 24 '11 at 04:48
  • 1
    Can you give more detail on what you've tried, and what parts you think might need improvement? Some basic work on your part is generally favored on SO. – Merlyn Morgan-Graham Apr 24 '11 at 04:52
  • 1
    @Merlyn Morgan-Graham: +1 for your point :) I agree that I should have posted my approach to adhere to the philosophy of SO. I have updated my question with the approach I had in mind but wasn't really sure about. I did not have pseudo-code ready so I was a little hesitant in putting that approach. – Legend Apr 24 '11 at 05:00

6 Answers6

27
        1     2     3     4     5     6     7     8     9     10     11     12     13     14     15
1             |--------------------------------------X---------|
2                   |--------------------------------X--------------------------------------------|
3                         |--------------------------X---|
4                                                  |-X-------------------------------------|
5                                           |--------X------------------------------|
6                               |--------------------X----------|
7                                                                     |---------------------------|

             +1    +1     +1   +1           +1     +1    -1    -2     +1           -1     -1     -2
              1     2     3     4           5       6    5      3     4             3      2      0
                                                     ^^^^

Get it?

So you need to transform this:

1: 2 - 10
2: 3 - 15
3: 4 - 9
4: 8 - 14
5: 7 - 13
6: 5 - 10
7: 11 - 15

into:

[(2,+), (3,+), (4,+), (5,+), (7,+), (8,+), (9,-), (10,-), (10,-), (11,+), (13,-), (14,-), (15,-), (15,-)]

and then you simply iterate through, counting up when you see a + and counting down on -. The busiest interval will be when the count is maximum.

So in code:

intervals = [(2, 10), (3, 15), (4, 9), (8, 14), (7, 13), (5, 10), (11, 15)]
intqueue = sorted([(x[0], +1) for x in intervals] + [(x[1], -1) for x in intervals])
rsum = [(0,0)]
for x in intqueue: 
    rsum.append((x[0], rsum[-1][1] + x[1]))
busiest_start = max(rsum, key=lambda x: x[1])
# busiest_end = the next element in rsum after busiest_start 

# instead of using lambda, alternatively you can do:
#     def second_element(x):
#         return x[1]
#     busiest_start = max(rsum, key=second_element)
# or:
#     import operator
#     busiest_start = max(rsum, key=operator.itemgetter(1))

runtime complexity is (n+n)*log(n+n)+n+n or O(n*log(n))

It is also possible to convert this idea into an online algorithm if you don't have the complete list of intervals at the start of the program but is guaranteed that incoming intervals will never be scheduled for a past point. Instead of sorting you will use a priority queue, each time an interval comes, you push in two items, the start point and the end point, each with a +1 and -1 respectively. And then you pop off and count and keep track of the peak hour.

Lie Ryan
  • 62,238
  • 13
  • 100
  • 144
  • @Lie Ryan: +1 Beautiful implementation! May be the next part is no related to the question but I'd appreciate your response on this. I think I have understood the entire code as a whole but can you please explain what exactly is being done in the last step? I know the output but what is the lambda really doing? Even better if you can just show me a non-lambda snippet as well along with it, I will fully understand the solution. Accepted as answer. Thank you. – Legend Apr 24 '11 at 05:32
  • Does it handle discontinuities? – sjr Apr 24 '11 at 05:35
  • @Legend: lambda is just an alternative syntax for function definition, see [this](http://diveintopython.org/power_of_introspection/lambda_functions.html). In `max(rsum, key=lambda x: x[1])`, this line is simply searching the element `(ai,bi)` in the array `[(a1,b1), ..., (an,bn)]` where the second element `bi` is maximum. – Lie Ryan Apr 24 '11 at 05:38
  • @sjr: between two non-overlapping subsets of intervals (I assume that's what you meant by "gap"), the counter should simply be 0 so the algorithm should still work, though I haven't tested it. – Lie Ryan Apr 24 '11 at 05:41
  • @sjr: If I understand you correctly, it seems to be handling it. I tried `intervals = [(2, 10), (3, 15), (4, 9), (8, 14), (7, 13), (5, 10), (11, 15), (4,7), (5,8), (17,20), (18,21), (16,24), (16, 22), (17, 20), (18,20), (16,20)]` – Legend Apr 24 '11 at 05:42
  • @Lie Ryan: Thank you for the non-lambda approach. – Legend Apr 24 '11 at 06:22
  • When applied to a list of pairs, does `sorted()` sort first by the first element and then by the second? That would seem to be necessary to guarantee that an interval that ends at the same place another interval starts will cause the "open interval count" to decrease before it increases. – j_random_hacker Apr 27 '11 at 02:07
  • 1
    @j_random_hacker: yes, in Python, tuples are compared lexicographically so `(5, -1) < (5, +1)` – Lie Ryan Apr 27 '11 at 13:19
6

I would start by thinking of the busy-ness of a point x as the number of activations to the left of x, minus the number of deactivations to the left of x. I would sort the activations and deactivations by the time at which they occur (in O(nlog(n)) time). Then you can traverse the list, tracking the number active (y), incrementing and decrementing that number with activations and deactivations passed. The busiest period will be the points at which y is at its maximum. I can't think of a solution off the top of my head that is better than O(nlog(n)). The brute force would be O(n^2).

Eric Mickelsen
  • 10,309
  • 2
  • 30
  • 41
  • R.K's solution looks like it is O(n) to me as long as he's correct in assuming we can pick a small set of discrete buckets. – Michael Greene Apr 24 '11 at 05:01
  • 1
    +1, simple and fast. Make sure that deactivations sort "less than" activations though, so that whenever one segment ends at the same position as another starts, the count of active segments will go down before it goes up. – j_random_hacker Apr 24 '11 at 05:01
  • @Michael: R.K's solution is not O(n) because each segment potentially needs to increment every bucket's count. It's O(nm) time, where m is the maximum endpoint of any segment, and it requires O(m) space. – j_random_hacker Apr 24 '11 at 05:04
  • @Michael Greene: R.K's solution is O(m*n) in the number of buckets and number of actors, not O(n). @j_random_hacker: Thanks, and an excellent suggestion. – Eric Mickelsen Apr 24 '11 at 05:05
  • 1
    O(m*n) where m is a constant is still O(n), and the way I read the problem m would be a small constant, such as 24. I now see this problem could include growing buckets in which case you are correct. – Michael Greene Apr 24 '11 at 05:12
  • @Eric Mickelsen: +1 Thank You for the logic. Very interesting. – Legend Apr 24 '11 at 05:33
  • @Michael: You're right, for fixed m an O(mn) solution is O(n), so it would be fine for e.g. hours in a day. But as you say, "fixed m" might not be a realistic assumption here. – j_random_hacker Apr 27 '11 at 02:00
4

I thought you could perhaps use a set() for this, and it would work if your assured that all periods intersect at at least one point.

However, this does not work as soon as a period does not intersect. You may be able to add additional logic to cover this, so I'll post what I was thinking:

>>> periods = [(2, 10), (3, 15), (4, 9), (8, 14), (7, 13), (5, 10),]
>>> intersected = None
>>> for first, second in periods:
...     if not intersected:
...         intersected = set(range(first, second + 1))
...     else:
...         intersected = intersected.intersection(set(range(first, second + 1)))
...
>>> intersected
set([8, 9])

Note: this does not include the 11-15 period. Your probably best off just creating bin pairs as mentioned by R.K.

monkut
  • 42,176
  • 24
  • 124
  • 155
  • Interesting. I updated my question with some approach I was thinking. May be I can integrate both. – Legend Apr 24 '11 at 05:01
4

Here's what I was thinking for the bin based approach, and adapted to handle adds dynamically, basically what R.K. was saying I believe.

from collections import defaultdict
from operator import itemgetter

class BusyHour(object):
    def __init__(self):
        self.pairs = defaultdict(int)
    def add_period(self, period):
        start, end = period
        for current_period in range(start, end):
            pair_key = (current_period, current_period + 1) 
            self.pairs[pair_key] += 1
    def get_max(self):
        # sort, defaults to smallest to largest
        # --> items() returns (key, value) pairs
        # --> itemgetter gets the given index of the first argument given to sorted
        return max(self.pairs.items(), key=itemgetter(1))


if __name__ == '__main__':
    periods = [(2, 10), (3, 15), (4, 9), (8, 14), (7, 13), (5, 10), (11, 15)]
    bh = BusyHour()
    for period in periods:
        bh.add_period(period)
    print bh.get_max()

Updated: Only sort on call to get_max, and use defaultdict(int).

Lie Ryan
  • 62,238
  • 13
  • 100
  • 144
monkut
  • 42,176
  • 24
  • 124
  • 155
  • Legend, no problem! It was a good excercise for me to recall how to use sorted(), itemgetter(), and defaultdict. I didn't realize until I did this that max() and min() had the "key" argument. – monkut Apr 25 '11 at 01:40
3

Not sure if i understand your question. If you are trying to find the most common "interval", you can sum them up per interval. This way you have 12 buckets for the above example. For each use, you would add 1 to each of the buckets used in that particular use, and at the end, find the maximum value in all the buckets. Here, that would be 6 for the 8-9 interval.

R.K
  • 153
  • 1
  • 7
  • : +1 Thank You. While this works, I am not sure if the approach is scalable but please correct me if I am wrong. The example I have given is a toy example but in reality, these numbers can be huge. If there is no better way of doing this, then yes, I think what you suggested does look like a potential way of doing this. – Legend Apr 24 '11 at 04:46
0

I've put together a small C++ program if you want to have a linear performance here. I know it is not Python, but the idea is very simple here.

We first create an array with all points and increment the item in the array if the interval starts at that index and decrement it if it ends on that index.

Once the array is constructed, we just iterate over and calculate where we had the maximum number of open intervals.

Time complexity is O(M + N)

Space complexity is O(N)

Where M is the number of intervals and N is the max value from the interval pairs.

#include <iostream>
#include <vector>

int maxLoad(const std::vector<std::pair<int, int>>& intervals) {
    std::vector<int> points;
    for(const auto& interval : intervals) {
        if(interval.second >= points.size()) points.resize(interval.second + 1);
        ++points[interval.first];
        --points[interval.second];
    }

    int ans = 0;
    int sum = 0;
    for(const auto point : points) {
        sum += point;
        ans = std::max(ans, sum);
    }
    return ans;
}

int main() {
    std::vector<std::pair<int, int>> intervals {
        {2, 10}, {3, 15}, {4, 9}, {8, 14}, {7, 13}, {5, 10}, {11, 15}
    };
    std::cout << maxLoad(intervals) << std::endl;
}

Hovhannes Grigoryan
  • 1,151
  • 1
  • 8
  • 11