I have a dataframe df with one entry per id or overlapping entries per id:
start_date | end_date | id | col1 | col2 | ...
2020-10-01 | 2020-10-01 | a | 2 | 3 | ...
2020-10-01 | 2020-10-03 | a | 4 | 5 | ...
2020-10-02 | 2020-10-07 | b | 7 | 8 | ...
2020-10-06 | 2020-10-08 | b | 9 | 10 | ...
2020-10-06 | 2020-10-08 | c | k | p | ...
I need to get days created from the start_date to end_date for each row.
End result should look like:
date | id | col1 | col2 | ...
2020-10-01 | a | 2 | 3 | ...
2020-10-01 | a | 4 | 5 | ...
2020-10-02 | a | 4 | 5 | ...
2020-10-03 | a | 4 | 5 | ...
2020-10-02 | b | 7 | 8 | ...
2020-10-03 | b | 7 | 8 | ...
2020-10-04 | b | 7 | 8 | ...
2020-10-05 | b | 7 | 8 | ...
2020-10-06 | b | 7 | 8 | ...
2020-10-07 | b | 7 | 8 | ...
2020-10-06 | b | 9 | 10 | ...
2020-10-07 | b | 9 | 10 | ...
2020-10-08 | b | 9 | 10 | ...
2020-10-06 | c | k | p | ...
2020-10-07 | c | k | p | ...
2020-10-08 | c | k | p | ...
I have ~10 million rows and this transformation needs to be done multiple times a day, so the transformation must be fast. All solutions I've found use apply/lambdas and take ~20-30 minutes on my PC, which is way too slow.
I tried melt+groupby: https://stackoverflow.com/a/57334167/4435175 But the resulting dataframe is missing the days between start_date and end_date.
Any fast ways to do this?