2

I need a fast approach to add rows for all the dates between two columns. I have already tried the following approaches:

df.index.repete: add rows for all dates between two columns?

pd.melt: How can I add rows for all dates between two columns?

date_range + merge: Get all dates between startdate and enddate columns

itertuples: Add rows to dataframe each day between a range of two columns

SQL script using sqlite: Crashed the machine when trying to do a join between two dates.

When they worked they were all very slow (between 25 to 75 minutes), I have 1,2 million rows in my dataframe which extends to over 30 million. Is there anyway to do any of these solutions using something like apply or vectorization? Is there a way of doing this in just a few minutes max?

Data has following structure:

start_date   end_date     ID

The final structure should be:

start_date   end_date     Date     ID
MartP
  • 21
  • 1
  • Do you mind to post the `df["start_date"].min()` and `df["end_date"].max()` I'll need to create a reproducible example – rpanai Jan 09 '20 at 14:24
  • 1
    Can you share enough code/data for us to be able to test any potential solutions? – AMC Jan 09 '20 at 14:54

1 Answers1

1

Calendrical calculation (pd.date_range) is a lot slower than generating a sequence of integers (np.arange). The answers you listed are pretty simple & elegant, but they are not very fast. Your volume of data calls for a different solution.

This answer assumes that ID is unique for each row. It took about 4.5 seconds for 1M rows on my iMac 2015. The final dataframe has ~25M rows.

import itertools

# `date_range` is slow so we only call it once
all_dates = pd.date_range(df['start_date'].min(), df['end_date'].max())

# For each day in the range, number them as 0, 1, 2, 3, ...
rank = all_dates.to_series().rank().astype(np.int64) - 1

# Change from `2020-01-01` to "day 500 in the all_dates array", for example
start = df['start_date'].map(rank).values
end = df['end_date'].map(rank).values

# This is where the magic happens. For each row, instead of saying
# `start_date = Jan 1, 2020` and `end_date = Jan 10, 2020`, we are
# creating a range of days: [500, 501, ... 509]
indices = list(itertools.chain.from_iterable([range(s,e+1) for s,e in zip(start, end)]))

# Now map day 500 back to Jan 1, 2020, day 501 back to Jan 2, 2020, and so on
dates = np.take(all_dates, indices)

# Align the rest of the columns to the expanded dates
duration = (end - start + 1).astype(np.int64)
ids = np.repeat(df['ID'], duration)
start_date = np.repeat(df['start_date'], duration)
end_date = np.repeat(df['end_date'], duration)

# Assemble the result
result = pd.DataFrame({
    'start_date': start_date,
    'end_date': end_date,
    'ID': ids,
    'Date': dates
})
Code Different
  • 90,614
  • 16
  • 144
  • 163
  • This did not work saying that the "ValueError: array length 300527269 does not match index length 301745242". I checked and the ID is not unique as I had thought. – MartP Jan 14 '20 at 08:50
  • I may have introduced a bug during optimization. Try change `range(s,e)` to `range(s,e+1)` – Code Different Jan 14 '20 at 13:49