0

I have a dataframe df with one entry per id or overlapping entries per id:

start_date | end_date   | id | col1 | col2 | ...
2020-10-01 | 2020-10-01 | a  | 2    | 3    | ...
2020-10-01 | 2020-10-03 | a  | 4    | 5    | ...
2020-10-02 | 2020-10-07 | b  | 7    | 8    | ...
2020-10-06 | 2020-10-08 | b  | 9    | 10   | ...
2020-10-06 | 2020-10-08 | c  | k    | p    | ...

I need to get days created from the start_date to end_date for each row.

End result should look like:

date       | id | col1 | col2 | ...
2020-10-01 | a  | 2    | 3    | ...
2020-10-01 | a  | 4    | 5    | ...
2020-10-02 | a  | 4    | 5    | ...
2020-10-03 | a  | 4    | 5    | ...
2020-10-02 | b  | 7    | 8    | ...
2020-10-03 | b  | 7    | 8    | ...
2020-10-04 | b  | 7    | 8    | ...
2020-10-05 | b  | 7    | 8    | ...
2020-10-06 | b  | 7    | 8    | ...
2020-10-07 | b  | 7    | 8    | ...
2020-10-06 | b  | 9    | 10   | ...
2020-10-07 | b  | 9    | 10   | ...
2020-10-08 | b  | 9    | 10   | ...
2020-10-06 | c  | k    | p    | ...
2020-10-07 | c  | k    | p    | ...
2020-10-08 | c  | k    | p    | ...

I have ~10 million rows and this transformation needs to be done multiple times a day, so the transformation must be fast. All solutions I've found use apply/lambdas and take ~20-30 minutes on my PC, which is way too slow.

I tried melt+groupby: https://stackoverflow.com/a/57334167/4435175 But the resulting dataframe is missing the days between start_date and end_date.

Any fast ways to do this?

Vega
  • 2,661
  • 5
  • 24
  • 49
  • Surely `melt` can do this? – sammywemmy Feb 16 '21 at 10:24
  • `I have ~10 million rows and this transformation needs to be done multiple times a day,` - not sure if exist fast pandas solution for this – jezrael Feb 16 '21 at 10:24
  • @sammywemmy - yop, `melt` + `groupby` with reindex – jezrael Feb 16 '21 at 10:25
  • 1
    @sammywemmy - I like `itertuples` like [here](https://stackoverflow.com/questions/62402562/python-split-start-and-end-date-into-all-days-between-start-and-end-date/62404862#62404862), but I think still slow :( – jezrael Feb 16 '21 at 10:26

1 Answers1

1

One idea is use DataFrame.itertuples, but if very large data it should be slow:

s = pd.concat([pd.Series(r.Index,pd.date_range(r.start_date, r.end_date)) 
                     for r in df.itertuples()])
# print (s)

df = df.drop(['start_date','end_date'], 1).join(pd.DataFrame({'Date':s.index}, index=s))
print (df)
  id col1 col2       Date
0  a    2    3 2020-10-01
1  a    4    5 2020-10-01
1  a    4    5 2020-10-02
1  a    4    5 2020-10-03
2  b    7    8 2020-10-02
2  b    7    8 2020-10-03
2  b    7    8 2020-10-04
2  b    7    8 2020-10-05
2  b    7    8 2020-10-06
2  b    7    8 2020-10-07
3  b    9   10 2020-10-06
3  b    9   10 2020-10-07
3  b    9   10 2020-10-08
4  c    k    p 2020-10-06
4  c    k    p 2020-10-07
4  c    k    p 2020-10-08
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks, takes ~300 seconds with my dataset, which is way faster than my old solution but still too slow sadly. – Vega Feb 16 '21 at 13:15
  • @Vega - I know, unfortunately. – jezrael Feb 16 '21 at 13:16
  • @Vega - Is not possible use sql solution, e.g. convert values to mysql and join there? – jezrael Feb 16 '21 at 13:16
  • I could put it into sqlite.db, query there, get it back as pd.DataFrame but I don't know how I could create days from a start_date/end_date in SQL. – Vega Feb 16 '21 at 13:38
  • @Vega - me too, I think if post this question with SQL tag (sqlite) then get some good solutions. – jezrael Feb 16 '21 at 13:39
  • 1
    I will preprocess my data with your code and just work with the saved parquet file, that will work for now. Thanks a lot. – Vega Feb 17 '21 at 11:04