Generating date range from dataframe columns with start and end dates

Question

I have the following Pandas dataframe:

shop	item	price	start_date_valid	end_date_valid
shop1	item1	100.00	2019-01-01	2019-01-06
shop1	item1	120.00	2019-01-07	2019-01-10
shop1	item2	80.00	2018-12-31	2019-01-03
shop1	item2	150.00	2019-01-04	2019-01-06
shop2	item1	105.00	2019-01-01	2019-01-06
shop2	item1	130.00	2019-01-07	2019-01-10
shop2	item2	85.00	2018-12-31	2019-01-03
shop2	item2	145.00	2019-01-04	2019-01-06

I need to get the following table as output:

shop	item	date	price
shop1	item1	2019-01-01	100.00
shop1	item1	2019-01-02	100.00
shop1	item1	2019-01-03	100.00
shop1	item1	2019-01-04	100.00
shop1	item1	2019-01-05	100.00
shop1	item1	2019-01-06	100.00
shop1	item1	2019-01-07	120.00
shop1	item1	2019-01-08	120.00
shop1	item1	2019-01-09	120.00
shop1	item1	2019-01-10	120.00
shop1	item2	2018-12-31	80.00
shop1	item2	2019-01-01	80.00
shop1	item2	2019-01-02	80.00
shop1	item2	2019-01-03	80.00
shop1	item2	2019-01-04	80.00
shop1	item2	2019-01-05	80.00
shop1	item2	2019-01-06	80.00

for all shop/item combinations. The problem is that the source dataframe is very big (about 100 million rows), so simple df.groupby(['shop', 'item']).apply(...) looks unsuitable to me. How can I produce the desired output in the fastest way possible?

Does this answer your question? [Expand pandas dataframe date ranges to individual rows](https://stackoverflow.com/questions/60039948/expand-pandas-dataframe-date-ranges-to-individual-rows) — AlexK, Mar 30 '21 at 06:29
Beware of the size of the final dataframe. If what you have shown is representative, you could end with 1 to 10 G rows. Unless you have a machine with a **lot** of memory, you could have to split the work in smaller pieces. — Serge Ballesta, Mar 30 '21 at 06:58
Seen AlexK's comment after posting my answer. It is correct that the question is similar. However, as the dataset size is big, we need better execution time solution. That solution called one `pd.date_range()` and 2 `pd.to_datetime()`. My solution just calls `pd.date_range()` once without separately calls `pd.to_datetime()`. Used %timeit to compare the run times, my solution is nearly 20% faster. — SeaBean, Mar 30 '21 at 07:31

SeaBean · Answer 1 · 2021-03-30T06:54:29.363

As your dataset is big, you have to use more efficient operations making use of pandas vectorized operation. You can use list(map()) together with pd.date_range() and df.explode() as follows:

df['date'] = list(map(lambda x, y: pd.date_range(start=x, end=y),
                      df['start_date_valid'], 
                      df['end_date_valid']))
df = (df.explode('date')
        .drop(['start_date_valid', 'end_date_valid'], axis=1))




print(df)

    shop   item  price       date
0  shop1  item1  100.0 2019-01-01
0  shop1  item1  100.0 2019-01-02
0  shop1  item1  100.0 2019-01-03
0  shop1  item1  100.0 2019-01-04
0  shop1  item1  100.0 2019-01-05
0  shop1  item1  100.0 2019-01-06
1  shop1  item1  120.0 2019-01-07
1  shop1  item1  120.0 2019-01-08
1  shop1  item1  120.0 2019-01-09
1  shop1  item1  120.0 2019-01-10
2  shop1  item2   80.0 2018-12-31
2  shop1  item2   80.0 2019-01-01
2  shop1  item2   80.0 2019-01-02
2  shop1  item2   80.0 2019-01-03
3  shop1  item2  150.0 2019-01-04
3  shop1  item2  150.0 2019-01-05
3  shop1  item2  150.0 2019-01-06
4  shop2  item1  105.0 2019-01-01
4  shop2  item1  105.0 2019-01-02
4  shop2  item1  105.0 2019-01-03
4  shop2  item1  105.0 2019-01-04
4  shop2  item1  105.0 2019-01-05
4  shop2  item1  105.0 2019-01-06
5  shop2  item1  130.0 2019-01-07
5  shop2  item1  130.0 2019-01-08
5  shop2  item1  130.0 2019-01-09
5  shop2  item1  130.0 2019-01-10
6  shop2  item2   85.0 2018-12-31
6  shop2  item2   85.0 2019-01-01
6  shop2  item2   85.0 2019-01-02
6  shop2  item2   85.0 2019-01-03
7  shop2  item2  145.0 2019-01-04
7  shop2  item2  145.0 2019-01-05
7  shop2  item2  145.0 2019-01-06

list(map()) is considerably faster than other means e.g. df.apply() on axis=1. Could be 3x ~ 4x times faster. Hence, suitable for your usage condition. You can refer to this post for further information on the system performance (execution time) comparison of various means to apply a function.

As an aside, you could unpack instead of calling the `list` method for a bit more speed — sammywemmy, Mar 30 '21 at 07:55
@sammywemmy You mean unpack in a manner like `*map(...` ? I refer to [here](https://docs.python.org/3/library/2to3.html?highlight=map#2to3fixer-map) to wrap `map()` in a `list` call. `*map(...` seems doesn't work. Would be great if there is any way to improve it. — SeaBean, Mar 30 '21 at 08:13
@sammywemmy I test run both `[*map(...)]` and `list(map(...))`. They are just the same. Always around 1.61ms vs 1.6 ms. Sometimes the first is faster and sometimes slower. Run several times and they are still equal in performance tests. I think these 2 can be regarded as same thing with different syntax. — SeaBean, Mar 30 '21 at 08:23
@sammywemmy I guess you may be referring to the case of passing parameters to a function using unpack would be faster than passing by individual parameters one by one. Like [this answer](https://stackoverflow.com/a/39259437/15070697) where using `f(*x)` could be twice as fast than `f(x.col_1, x.col_2)` (refer to the comment by Sylvain in that answer). Back to our case, it seems not related to that good case of using unpack for passing parameters. In our case, `[...]` is just the same as `list()` with different syntax. — SeaBean, Mar 30 '21 at 09:21

score 0 · Answer 2 · answered May 26 '22 at 12:33

0

good one, same as below solution.

df["daterange"] = df.apply(lambda x: pd.date_range(x.start_date, x.end_date), axis=1)
df = df.explode('daterange').reset_index(drop=True)
print (df)

answered May 26 '22 at 12:33

ayush

1

If it's the same solution as below, no need to give the same answer again if you don't add more information. – NoNameAv May 26 '22 at 18:20

Generating date range from dataframe columns with start and end dates

2 Answers2