1

i need to be able to work out how many sessions are running at any given time, per minute based on millions of rows of data like the ones below.

I have tried melting the dataframe and have created a new column which is equal to 1 or -1 depending on whether its the start or the end. Summing that and grouping by time, I hoped would have worked.

The issue is, lots of sessions may start the day before and because I can only query today, I end up with MINUS 2 million in the early hours of the mornings, as those sessions end.

Is there a good way to do this & chart it so it shows the number of sessions at any given time?

2020-05-31 00:00:01 | 2020-05-31 00:00:31   
2020-05-31 00:01:01 | 2020-05-31 00:02:01   
2020-05-31 00:02:01 | 2020-05-31 00:06:03   
2020-05-31 00:03:01 | 2020-05-31 00:04:01   
2020-05-31 00:04:01 | 2020-05-31 00:34:01   
kikee1222
  • 1,866
  • 2
  • 23
  • 46

2 Answers2

1

So I have an attempt that maybe someone can build off of (or otherwise it could bother someone enough to provide a better answer )? Here's your data, I just added column names:

In[1]: df
Out[1]: 
       Session_Starts        Session_Ends
0 2020-05-31 00:00:01 2020-05-31 00:00:31
1 2020-05-31 00:01:01 2020-05-31 00:02:01
2 2020-05-31 00:02:01 2020-05-31 00:06:03
3 2020-05-31 00:03:01 2020-05-31 00:04:01
4 2020-05-31 00:04:01 2020-05-31 00:34:01

I round each session start and end down a minute, and then make a date_range (with minute frequency) in between those new start and end times. This gives an array of arrays of unique minutes when each session was active. I then unpack this list into a Series and get the value_counts().

import pandas as pd
import numpy as np
from itertools import chain

session_starts = (x - pd.Timedelta(seconds=x.second) for x in df['Session_Starts'])
session_ends = (x - pd.Timedelta(seconds=x.second) for x in df['Session_Ends'])
ranges = (pd.date_range(x,y,freq='1T') for x,y in zip(session_starts,session_ends))
ranges = pd.Series(chain.from_iterable(ranges))
output = ranges.value_counts(sort=False).sort_index()

Output:

2020-05-31 00:00:00    1
2020-05-31 00:01:00    1
2020-05-31 00:02:00    2
2020-05-31 00:03:00    2
2020-05-31 00:04:00    3
2020-05-31 00:05:00    2
2020-05-31 00:06:00    2
2020-05-31 00:07:00    1
2020-05-31 00:08:00    1
...
2020-05-31 00:33:00    1
2020-05-31 00:34:00    1
dtype: int64

The problem is scale, aka the millions of observations you said. I was trying with toy data under one million in length and it already starts to take very long:

SIZE = 100000

dr = pd.date_range(start='01-01-2020',end='1-02-2020',freq='1T')
col1 = np.random.choice(dr, SIZE)
deltas = pd.Series([pd.Timedelta(minutes = r) for r in np.random.randint(0,10,size=SIZE)])
col2 = col1 + deltas
df = pd.DataFrame({'Session_Starts':col1,'Session_Ends':col2})

Using timeit, running this df through the same code above takes over 20 seconds. I believe the time is scaling ~linearly with the number of rows.

I'm not able to come up with anything better, but I'm sure there must be; I would be curious to see how it can be improved on (or simply a better solution). Hope this either helps or at least gets the ball rolling.

Tom
  • 8,310
  • 2
  • 16
  • 36
  • Massive thank you for your response. I'll try to run this in Koalas on Spark & see if we can bring executin time down through parallelization :) – kikee1222 Jun 05 '20 at 13:22
1

My original approach was to create a DatetimeIndex that represents the time period which contains all of the events in the data and then for each event create an array, with the same dimension as the index, whose values are 1 or True when the event was taking place and 0 or False otherwise. Adding those arrays yields the concurrent events total at each time. A much better approach is to consider only the times in which new events started (+1) or ended (-1) and then take the cumulative sum of those changes. We can expand those results to the whole period containing the events by reindexing and filling.

Loading the data

import pandas as pd

# Data from the question
data = [['2020-05-31 00:00:01', '2020-05-31 00:00:31'],
        ['2020-05-31 00:01:01', '2020-05-31 00:02:01'],
        ['2020-05-31 00:02:01', '2020-05-31 00:06:03'],
        ['2020-05-31 00:03:01', '2020-05-31 00:04:01'],
        ['2020-05-31 00:04:01', '2020-05-31 00:34:01']]

# The data as a DataFrame
df = pd.DataFrame(data,  columns=['Start time', 'End time'], dtype='datetime64[ns]')

Creating the DatetimeIndex

It makes sense for the frequency to match the time granularity of the event timestamps.

min_time = df['Start time'].min()
max_time = df['End time'].max()
ts_index = pd.date_range(min_time, max_time, freq = 's')

Calculating concurrency

In the first two methods we create a data structure which corresponds to an array of the same dimension as the index for every event. These arrays indicate when the event took place. If there are a lot of events it is best to create an iterator otherwise we risk running out of memory. The third method focuses on the start and end of the events rather than characterizing individual events though out the whole period.

1. With a Series

There is no risk of running out of memory with this small example so we create a Series of arrays and add them.

concurrency_array = df.apply(lambda e: ((ts_index >= e[0]) & (ts_index <= e[1])).astype(int), axis='columns').sum()
concurrency = pd.Series(concurrency_array, index = ts_index)

2. With an iterator

This will avoid having all of the arrays loaded in memory at once. Notice that here we use the python map and sum functions rather than pandas constructs.

concurrency_iter = map(lambda e: (ts_index >= e[0]) & (ts_index <= e[1]), df.values)
concurrency = pd.Series(sum(concurrency_iter), index = ts_index)

3. With a Series of only the changes (Best)

This approach is much faster than anything I could come up with and in general terms it is simply better. I got the idea from this answer.

Basically we create a Series with all of the start and end times for all the events, with the start times having a value of 1 and the end times having a value of -1. We then groupby the index values and sum, which yields a Series with all the changes (i.e. event starts, ends, and any combination of the two). Then we take the cumulative sum (cumsum) which yields the total concurrent events at the times they change, that is, at the times when at least one event starts or ends. To get the results for the whole period we simply reindex with the index we created previously and fill forward (ffill).

starts = pd.Series(1, df['Start time'])
ends = pd.Series(-1, df['End time'] + pd.Timedelta('1 sec')) # Include last second

concurrency_changes = pd.concat([starts, ends]) \
                      .groupby(level=0).sum() \
                      .cumsum()
concurrency = concurrency_changes.reindex(ts_index, method='ffill')

Result

The result of all of the methods above is a Series whose index is the DatetimeIndex we created previously and whose values are the total of concurrent events in our data.

Resampling

Now that we have a Series with the concurrency data we can resample at our convenience. For example if we are investigating the maximum utilization of a certain resource we might do something like this:

In [5]: concurrency.resample('5T').max()
Out[5]:
2020-05-31 00:00:00    3
2020-05-31 00:05:00    2
2020-05-31 00:10:00    1
2020-05-31 00:15:00    1
2020-05-31 00:20:00    1
2020-05-31 00:25:00    1
2020-05-31 00:30:00    1
Freq: 5T, dtype: int64
dicristina
  • 335
  • 2
  • 13