0

I have a Pandas dataframe containing two datetime columns: start and end. I would like to transform into another dataframe (or series) where each row is a point in time (at whatever frequency, let's just say day for now), and the values are the number of rows in the original dataframe that contain that point in time. What is the most "efficient" way to do this (by time and/or memory) with pandas?

A concrete example would be the following input data frame:

enter image description here

With the following result:

enter image description here

My current solution is the following:

df.apply(lambda x: pd.date_range(x['start'], x['end']), axis = 1).explode().value_counts().sort_index()

This works as as expected. However, my real use case has over 1.5 million records, with each interval about 180 days long on average, so after the .explode step, I have a series with 280 million entries, which eats up more memory than is probably necessary for this problem. I am also wondering if I can improve speed by avoiding creating such a big series before shrinking it back down. It also has the disadvantage that if there are any times between the minimum start date and maximum end date that are not in any of the intervals, then those times would not appear in the result at all (instead of appearing with a value of 0, which would be better).

The only other option I can think of is to iterate over a range from the earliest start to latest end, something like [((x['start'] >= date) & (x['end'] <= date)).sum() for date in pd.date_range(df['start'].min(), df['end'].max())]. But I tend to try to avoid iteration in general.

What is the most efficient approach?

This question may be a possible duplicate of here as the accepted solution includes a possible answer here as a subroutine, but that subroutine is not the focus of the question and there is no discussion of how the solution works or how efficient it is.

Ike348
  • 139
  • 7

1 Answers1

-1

I would suggest that you slice the data into multiple files instead of making it through another thread which is a bit excessive for this problem. Use Bash first to split the file into multiple ones then feed them into a dataframe at each iteration.

split -l <insert number of lines here> <filename>

Gather all the files and put them into a folder you can read from and perform the same process.

Hope this helps!

Ehab
  • 24
  • 2