I have a dataframe as follows:
date | some_quantity |
---|---|
... | ... |
2021-01-01 | 4 |
2021-01-02 | 1 |
2021-01-03 | 6 |
2021-01-04 | 2 |
2021-01-05 | 2 |
2021-01-06 | 8 |
2021-01-07 | 9 |
2021-01-08 | 1 |
... | ... |
I would like to create the historical data for each calendar day, and in a final step do some aggregations. The intermediate dataframe should look like this:
calendar_date | date | some_quantity |
---|---|---|
... | ... | ... |
2021-01-03 | 2021-01-01 | 4 |
2021-01-03 | 2021-01-02 | 1 |
2021-01-04 | ... | ... |
2021-01-04 | 2021-01-01 | 4 |
2021-01-04 | 2021-01-02 | 1 |
2021-01-04 | 2021-01-03 | 6 |
2021-01-05 | ... | ... |
2021-01-05 | 2021-01-01 | 4 |
2021-01-05 | 2021-01-02 | 1 |
2021-01-05 | 2021-01-03 | 6 |
2021-01-05 | 2021-01-04 | 2 |
2021-01-06 | ... | ... |
2021-01-06 | 2021-01-01 | 4 |
2021-01-06 | 2021-01-02 | 1 |
2021-01-06 | 2021-01-03 | 6 |
2021-01-06 | 2021-01-04 | 2 |
2021-01-06 | 2021-01-05 | 2 |
2021-01-06 | ... | ... |
With this dataframe any aggregation on the calendar date is easy (e.g indicate how many quantities were sold prior to that day, average 7days, average30days, etc.).
I tried to run a for loop of calendar dates:
for i, date in enumerate(pd.data_range("2021-01-01","2021-03-01"):
df_output = []
df_transformed = df.where(F.col("date") < date)
df_transformed = df_transformed.withColumn("calendar_date", date)
if i == 0:
df_output = df_transformed
else:
df_output = df_output.union(df_transformed)
However, this is highly inefficient and it crashes as soon as I started including more columns.
Is it possible to create a dataframe with calendar dates and do a join that recreated the dataframe I expect?
Thanks for any help.