1

I have the following Pandas dataframe:

shop item price start_date_valid end_date_valid
shop1 item1 100.00 2019-01-01 2019-01-06
shop1 item1 120.00 2019-01-07 2019-01-10
shop1 item2 80.00 2018-12-31 2019-01-03
shop1 item2 150.00 2019-01-04 2019-01-06
shop2 item1 105.00 2019-01-01 2019-01-06
shop2 item1 130.00 2019-01-07 2019-01-10
shop2 item2 85.00 2018-12-31 2019-01-03
shop2 item2 145.00 2019-01-04 2019-01-06

I need to get the following table as output:

shop item date price
shop1 item1 2019-01-01 100.00
shop1 item1 2019-01-02 100.00
shop1 item1 2019-01-03 100.00
shop1 item1 2019-01-04 100.00
shop1 item1 2019-01-05 100.00
shop1 item1 2019-01-06 100.00
shop1 item1 2019-01-07 120.00
shop1 item1 2019-01-08 120.00
shop1 item1 2019-01-09 120.00
shop1 item1 2019-01-10 120.00
shop1 item2 2018-12-31 80.00
shop1 item2 2019-01-01 80.00
shop1 item2 2019-01-02 80.00
shop1 item2 2019-01-03 80.00
shop1 item2 2019-01-04 80.00
shop1 item2 2019-01-05 80.00
shop1 item2 2019-01-06 80.00

for all shop/item combinations. The problem is that the source dataframe is very big (about 100 million rows), so simple df.groupby(['shop', 'item']).apply(...) looks unsuitable to me. How can I produce the desired output in the fastest way possible?

  • 1
    Does this answer your question? [Expand pandas dataframe date ranges to individual rows](https://stackoverflow.com/questions/60039948/expand-pandas-dataframe-date-ranges-to-individual-rows) – AlexK Mar 30 '21 at 06:29
  • 1
    Beware of the size of the final dataframe. If what you have shown is representative, you could end with 1 to 10 G rows. Unless you have a machine with a **lot** of memory, you could have to split the work in smaller pieces. – Serge Ballesta Mar 30 '21 at 06:58
  • Seen AlexK's comment after posting my answer. It is correct that the question is similar. However, as the dataset size is big, we need better execution time solution. That solution called one `pd.date_range()` and 2 `pd.to_datetime()`. My solution just calls `pd.date_range()` once without separately calls `pd.to_datetime()`. Used %timeit to compare the run times, my solution is nearly 20% faster. – SeaBean Mar 30 '21 at 07:31

2 Answers2

6

As your dataset is big, you have to use more efficient operations making use of pandas vectorized operation. You can use list(map()) together with pd.date_range() and df.explode() as follows:

df['date'] = list(map(lambda x, y: pd.date_range(start=x, end=y),
                      df['start_date_valid'], 
                      df['end_date_valid']))
df = (df.explode('date')
        .drop(['start_date_valid', 'end_date_valid'], axis=1))




print(df)

    shop   item  price       date
0  shop1  item1  100.0 2019-01-01
0  shop1  item1  100.0 2019-01-02
0  shop1  item1  100.0 2019-01-03
0  shop1  item1  100.0 2019-01-04
0  shop1  item1  100.0 2019-01-05
0  shop1  item1  100.0 2019-01-06
1  shop1  item1  120.0 2019-01-07
1  shop1  item1  120.0 2019-01-08
1  shop1  item1  120.0 2019-01-09
1  shop1  item1  120.0 2019-01-10
2  shop1  item2   80.0 2018-12-31
2  shop1  item2   80.0 2019-01-01
2  shop1  item2   80.0 2019-01-02
2  shop1  item2   80.0 2019-01-03
3  shop1  item2  150.0 2019-01-04
3  shop1  item2  150.0 2019-01-05
3  shop1  item2  150.0 2019-01-06
4  shop2  item1  105.0 2019-01-01
4  shop2  item1  105.0 2019-01-02
4  shop2  item1  105.0 2019-01-03
4  shop2  item1  105.0 2019-01-04
4  shop2  item1  105.0 2019-01-05
4  shop2  item1  105.0 2019-01-06
5  shop2  item1  130.0 2019-01-07
5  shop2  item1  130.0 2019-01-08
5  shop2  item1  130.0 2019-01-09
5  shop2  item1  130.0 2019-01-10
6  shop2  item2   85.0 2018-12-31
6  shop2  item2   85.0 2019-01-01
6  shop2  item2   85.0 2019-01-02
6  shop2  item2   85.0 2019-01-03
7  shop2  item2  145.0 2019-01-04
7  shop2  item2  145.0 2019-01-05
7  shop2  item2  145.0 2019-01-06

list(map()) is considerably faster than other means e.g. df.apply() on axis=1. Could be 3x ~ 4x times faster. Hence, suitable for your usage condition. You can refer to this post for further information on the system performance (execution time) comparison of various means to apply a function.

SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • As an aside, you could unpack instead of calling the `list` method for a bit more speed – sammywemmy Mar 30 '21 at 07:55
  • @sammywemmy You mean unpack in a manner like `*map(...` ? I refer to [here](https://docs.python.org/3/library/2to3.html?highlight=map#2to3fixer-map) to wrap `map()` in a `list` call. `*map(...` seems doesn't work. Would be great if there is any way to improve it. – SeaBean Mar 30 '21 at 08:13
  • @sammywemmy I test run both `[*map(...)]` and `list(map(...))`. They are just the same. Always around 1.61ms vs 1.6 ms. Sometimes the first is faster and sometimes slower. Run several times and they are still equal in performance tests. I think these 2 can be regarded as same thing with different syntax. – SeaBean Mar 30 '21 at 08:23
  • 1
    @sammywemmy I guess you may be referring to the case of passing parameters to a function using unpack would be faster than passing by individual parameters one by one. Like [this answer](https://stackoverflow.com/a/39259437/15070697) where using `f(*x)` could be twice as fast than `f(x.col_1, x.col_2)` (refer to the comment by Sylvain in that answer). Back to our case, it seems not related to that good case of using unpack for passing parameters. In our case, `[...]` is just the same as `list()` with different syntax. – SeaBean Mar 30 '21 at 09:21
0

good one, same as below solution.

df["daterange"] = df.apply(lambda x: pd.date_range(x.start_date, x.end_date), axis=1)
df = df.explode('daterange').reset_index(drop=True)
print (df)
ayush
  • 1
  • If it's the same solution as below, no need to give the same answer again if you don't add more information. – NoNameAv May 26 '22 at 18:20