Transform rows with start_date and end_date to daily rows

Question

I have a dataframe df with one entry per id or overlapping entries per id:

start_date | end_date   | id | col1 | col2 | ...
2020-10-01 | 2020-10-01 | a  | 2    | 3    | ...
2020-10-01 | 2020-10-03 | a  | 4    | 5    | ...
2020-10-02 | 2020-10-07 | b  | 7    | 8    | ...
2020-10-06 | 2020-10-08 | b  | 9    | 10   | ...
2020-10-06 | 2020-10-08 | c  | k    | p    | ...

I need to get days created from the start_date to end_date for each row.

End result should look like:

date       | id | col1 | col2 | ...
2020-10-01 | a  | 2    | 3    | ...
2020-10-01 | a  | 4    | 5    | ...
2020-10-02 | a  | 4    | 5    | ...
2020-10-03 | a  | 4    | 5    | ...
2020-10-02 | b  | 7    | 8    | ...
2020-10-03 | b  | 7    | 8    | ...
2020-10-04 | b  | 7    | 8    | ...
2020-10-05 | b  | 7    | 8    | ...
2020-10-06 | b  | 7    | 8    | ...
2020-10-07 | b  | 7    | 8    | ...
2020-10-06 | b  | 9    | 10   | ...
2020-10-07 | b  | 9    | 10   | ...
2020-10-08 | b  | 9    | 10   | ...
2020-10-06 | c  | k    | p    | ...
2020-10-07 | c  | k    | p    | ...
2020-10-08 | c  | k    | p    | ...

I have ~10 million rows and this transformation needs to be done multiple times a day, so the transformation must be fast. All solutions I've found use apply/lambdas and take ~20-30 minutes on my PC, which is way too slow.

I tried melt+groupby: https://stackoverflow.com/a/57334167/4435175 But the resulting dataframe is missing the days between start_date and end_date.

Any fast ways to do this?

`I have ~10 million rows and this transformation needs to be done multiple times a day,` - not sure if exist fast pandas solution for this — jezrael, Feb 16 '21 at 10:24
@sammywemmy - I like `itertuples` like [here](https://stackoverflow.com/questions/62402562/python-split-start-and-end-date-into-all-days-between-start-and-end-date/62404862#62404862), but I think still slow :( — jezrael, Feb 16 '21 at 10:26

jezrael · Accepted Answer · 2021-02-16T12:44:46.863

1

One idea is use DataFrame.itertuples, but if very large data it should be slow:

s = pd.concat([pd.Series(r.Index,pd.date_range(r.start_date, r.end_date)) 
                     for r in df.itertuples()])
# print (s)

df = df.drop(['start_date','end_date'], 1).join(pd.DataFrame({'Date':s.index}, index=s))
print (df)
  id col1 col2       Date
0  a    2    3 2020-10-01
1  a    4    5 2020-10-01
1  a    4    5 2020-10-02
1  a    4    5 2020-10-03
2  b    7    8 2020-10-02
2  b    7    8 2020-10-03
2  b    7    8 2020-10-04
2  b    7    8 2020-10-05
2  b    7    8 2020-10-06
2  b    7    8 2020-10-07
3  b    9   10 2020-10-06
3  b    9   10 2020-10-07
3  b    9   10 2020-10-08
4  c    k    p 2020-10-06
4  c    k    p 2020-10-07
4  c    k    p 2020-10-08

edited Feb 16 '21 at 12:44

answered Feb 16 '21 at 12:28

jezrael

822,522
95
1,334
1,252

Thanks, takes ~300 seconds with my dataset, which is way faster than my old solution but still too slow sadly. – Vega Feb 16 '21 at 13:15
@Vega - I know, unfortunately. – jezrael Feb 16 '21 at 13:16
@Vega - Is not possible use sql solution, e.g. convert values to mysql and join there? – jezrael Feb 16 '21 at 13:16
I could put it into sqlite.db, query there, get it back as pd.DataFrame but I don't know how I could create days from a start_date/end_date in SQL. – Vega Feb 16 '21 at 13:38
@Vega - me too, I think if post this question with SQL tag (sqlite) then get some good solutions. – jezrael Feb 16 '21 at 13:39
1

I will preprocess my data with your code and just work with the saved parquet file, that will work for now. Thanks a lot. – Vega Feb 17 '21 at 11:04

Transform rows with start_date and end_date to daily rows

1 Answers1