I have a Pandas dataframe containing two datetime columns: start
and end
. I would like to transform into another dataframe (or series) where each row is a point in time (at whatever frequency, let's just say day for now), and the values are the number of rows in the original dataframe that contain that point in time. What is the most "efficient" way to do this (by time and/or memory) with pandas?
A concrete example would be the following input data frame:
With the following result:
My current solution is the following:
df.apply(lambda x: pd.date_range(x['start'], x['end']), axis = 1).explode().value_counts().sort_index()
This works as as expected. However, my real use case has over 1.5 million records, with each interval about 180 days long on average, so after the .explode
step, I have a series with 280 million entries, which eats up more memory than is probably necessary for this problem. I am also wondering if I can improve speed by avoiding creating such a big series before shrinking it back down. It also has the disadvantage that if there are any times between the minimum start date and maximum end date that are not in any of the intervals, then those times would not appear in the result at all (instead of appearing with a value of 0, which would be better).
The only other option I can think of is to iterate over a range from the earliest start to latest end, something like [((x['start'] >= date) & (x['end'] <= date)).sum() for date in pd.date_range(df['start'].min(), df['end'].max())]
. But I tend to try to avoid iteration in general.
What is the most efficient approach?
This question may be a possible duplicate of here as the accepted solution includes a possible answer here as a subroutine, but that subroutine is not the focus of the question and there is no discussion of how the solution works or how efficient it is.