My original approach was to create a DatetimeIndex
that represents the time period which contains all of the events in the data and then for each event create an array, with the same dimension as the index, whose values are 1
or True
when the event was taking place and 0
or False
otherwise. Adding those arrays yields the concurrent events total at each time. A much better approach is to consider only the times in which new events started (+1
) or ended (-1
) and then take the cumulative sum of those changes. We can expand those results to the whole period containing the events by reindexing and filling.
Loading the data
import pandas as pd
# Data from the question
data = [['2020-05-31 00:00:01', '2020-05-31 00:00:31'],
['2020-05-31 00:01:01', '2020-05-31 00:02:01'],
['2020-05-31 00:02:01', '2020-05-31 00:06:03'],
['2020-05-31 00:03:01', '2020-05-31 00:04:01'],
['2020-05-31 00:04:01', '2020-05-31 00:34:01']]
# The data as a DataFrame
df = pd.DataFrame(data, columns=['Start time', 'End time'], dtype='datetime64[ns]')
Creating the DatetimeIndex
It makes sense for the frequency to match the time granularity of the event timestamps.
min_time = df['Start time'].min()
max_time = df['End time'].max()
ts_index = pd.date_range(min_time, max_time, freq = 's')
Calculating concurrency
In the first two methods we create a data structure which corresponds to an array of the same dimension as the index for every event. These arrays indicate when the event took place. If there are a lot of events it is best to create an iterator otherwise we risk running out of memory. The third method focuses on the start and end of the events rather than characterizing individual events though out the whole period.
1. With a Series
There is no risk of running out of memory with this small example so we create a Series of arrays and add them.
concurrency_array = df.apply(lambda e: ((ts_index >= e[0]) & (ts_index <= e[1])).astype(int), axis='columns').sum()
concurrency = pd.Series(concurrency_array, index = ts_index)
2. With an iterator
This will avoid having all of the arrays loaded in memory at once. Notice that here we use the python map
and sum
functions rather than pandas constructs.
concurrency_iter = map(lambda e: (ts_index >= e[0]) & (ts_index <= e[1]), df.values)
concurrency = pd.Series(sum(concurrency_iter), index = ts_index)
3. With a Series of only the changes (Best)
This approach is much faster than anything I could come up with and in general terms it is simply better. I got the idea from this answer.
Basically we create a Series with all of the start and end times for all the events, with the start times having a value of 1
and the end times having a value of -1
. We then groupby
the index values and sum
, which yields a Series with all the changes (i.e. event starts, ends, and any combination of the two). Then we take the cumulative sum (cumsum
) which yields the total concurrent events at the times they change, that is, at the times when at least one event starts or ends. To get the results for the whole period we simply reindex
with the index we created previously and fill forward (ffill
).
starts = pd.Series(1, df['Start time'])
ends = pd.Series(-1, df['End time'] + pd.Timedelta('1 sec')) # Include last second
concurrency_changes = pd.concat([starts, ends]) \
.groupby(level=0).sum() \
.cumsum()
concurrency = concurrency_changes.reindex(ts_index, method='ffill')
Result
The result of all of the methods above is a Series whose index is the DatetimeIndex
we created previously and whose values are the total of concurrent events in our data.
Resampling
Now that we have a Series with the concurrency data we can resample at our convenience. For example if we are investigating the maximum utilization of a certain resource we might do something like this:
In [5]: concurrency.resample('5T').max()
Out[5]:
2020-05-31 00:00:00 3
2020-05-31 00:05:00 2
2020-05-31 00:10:00 1
2020-05-31 00:15:00 1
2020-05-31 00:20:00 1
2020-05-31 00:25:00 1
2020-05-31 00:30:00 1
Freq: 5T, dtype: int64