Create historical data from a dataframe in pyspark

Question

I have a dataframe as follows:

date	some_quantity
...	...
2021-01-01	4
2021-01-02	1
2021-01-03	6
2021-01-04	2
2021-01-05	2
2021-01-06	8
2021-01-07	9
2021-01-08	1
...	...

I would like to create the historical data for each calendar day, and in a final step do some aggregations. The intermediate dataframe should look like this:

calendar_date	date	some_quantity
...	...	...
2021-01-03	2021-01-01	4
2021-01-03	2021-01-02	1
2021-01-04	...	...
2021-01-04	2021-01-01	4
2021-01-04	2021-01-02	1
2021-01-04	2021-01-03	6
2021-01-05	...	...
2021-01-05	2021-01-01	4
2021-01-05	2021-01-02	1
2021-01-05	2021-01-03	6
2021-01-05	2021-01-04	2
2021-01-06	...	...
2021-01-06	2021-01-01	4
2021-01-06	2021-01-02	1
2021-01-06	2021-01-03	6
2021-01-06	2021-01-04	2
2021-01-06	2021-01-05	2
2021-01-06	...	...

With this dataframe any aggregation on the calendar date is easy (e.g indicate how many quantities were sold prior to that day, average 7days, average30days, etc.).

I tried to run a for loop of calendar dates:

for i, date in enumerate(pd.data_range("2021-01-01","2021-03-01"):

   df_output = []

   df_transformed = df.where(F.col("date") < date)
   df_transformed = df_transformed.withColumn("calendar_date", date)

   if i == 0:
      df_output = df_transformed
   else:
      df_output = df_output.union(df_transformed)

However, this is highly inefficient and it crashes as soon as I started including more columns.

Is it possible to create a dataframe with calendar dates and do a join that recreated the dataframe I expect?

Thanks for any help.

This is an [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). You don't need to create this dataframe to do time series calculations. Simply use window functions with a time range window. — mck, May 10 '21 at 10:23
Sorry, I don't follow. Could you be more specific about the window functions with a time range? An example would be great. Also, the real dataset is quite big (both in rows and columns), and I simplified as to keep it in a manageable form. — Milo Ventimiglia, May 10 '21 at 11:23
See https://stackoverflow.com/questions/33207164/spark-window-functions-rangebetween-dates — mck, May 10 '21 at 11:28
I think there's a better answer as this only works if the dates are all there. If dates were not 2021-01-01, 2021-01-02,... 2021-01-03, 2021-01-04... but instead 2021-01-01, 2021-01-03, 2021-01-05 how would you do it? Cuz I want all the calendar dates in a year. — Milo Ventimiglia, May 10 '21 at 13:25
Btw, I think it is relevant to mention that this is for backtesting, ie. testing different parameters in historical data. Hence, the need for the calendar dates. Here it is simplified but each parameter combination would yield the dataframe that you see above, that ultimatey needs to be unioned for comparison purposes. — Milo Ventimiglia, May 10 '21 at 13:53

score 0 · Accepted Answer · answered May 13 '21 at 04:47

You can simply join a calendar dataframe with your main dataframe with join condition "less than":

# Let's call your main dataframe as `df`

# Extracting first and last date
_, min_date, max_date = (df
    .groupBy(F.lit(1))
    .agg(
        F.min('date').alias('min_date'),
        F.max('date').alias('max_date'),
    )
    .first()
)

# Then create a temporary dataframe to hold all calendar dates
dates = [{'calendar_date': str(d.date())} for d in pd.date_range(min_date, max_date)]
calendar_df = spark.createDataFrame(dates)
calendar_df.show(10, False)
# +-------------+
# |calendar_date|
# +-------------+
# |2021-01-01   |
# |2021-01-02   |
# |2021-01-03   |
# |2021-01-04   |
# |2021-01-05   |
# |2021-01-06   |
# |2021-01-07   |
# |2021-01-08   |
# +-------------+

# Now you can join to build your expected dataframe, note the join condition
(calendar_df
    .join(df, on=[calendar_df.calendar_date > df.date])
    .show()
)
# +-------------+----------+---+
# |calendar_date|      date|qty|
# +-------------+----------+---+
# |   2021-01-02|2021-01-01|  4|
# |   2021-01-03|2021-01-01|  4|
# |   2021-01-03|2021-01-02|  1|
# |   2021-01-04|2021-01-01|  4|
# |   2021-01-04|2021-01-02|  1|
# |   2021-01-04|2021-01-03|  6|
# |   2021-01-05|2021-01-01|  4|
# |   2021-01-05|2021-01-02|  1|
# |   2021-01-05|2021-01-03|  6|
# |   2021-01-05|2021-01-04|  2|
# |   2021-01-06|2021-01-01|  4|
# |   2021-01-06|2021-01-02|  1|
# |   2021-01-06|2021-01-03|  6|
# |   2021-01-06|2021-01-04|  2|
# |   2021-01-06|2021-01-05|  2|
# |   2021-01-07|2021-01-01|  4|
# |   2021-01-07|2021-01-02|  1|
# |   2021-01-07|2021-01-03|  6|
# |   2021-01-07|2021-01-04|  2|
# |   2021-01-07|2021-01-05|  2|
# +-------------+----------+---+
# only showing top 20 rows

Create historical data from a dataframe in pyspark

1 Answers1