Suppose I have a table with three columns: dt
, id
and value
.
df_tmp = spark.createDataFrame([('2023-01-01', 1001, 5),
('2023-01-15', 1001, 3),
('2023-02-10', 1001, 1),
('2023-02-20', 1001, 2),
('2023-01-02', 1002, 7),
('2023-01-02', 1002, 6),
('2023-01-03', 1002, 1)],
["date", "id", "value"])
df.show()
# +----------+----+-----+
# | date| id|value|
# +----------+----+-----+
# |2023-01-01|1001| 5|
# |2023-01-15|1001| 3|
# |2023-02-10|1001| 1|
# |2023-02-20|1001| 2|
# |2023-01-02|1002| 7|
# |2023-01-02|1002| 6|
# |2023-01-03|1002| 1|
# +----------+----+-----+
I would like to compute the 30-day rolling sum of value
grouped by id
for every date
, and additionally, a number of distinct dates that the id
was seen. Something that would look like this:
+----------+----+-----+----------------+-------------------------+
| date| id|value|30_day_value_sum|days_seen_in_past_30_days|
+----------+----+-----+----------------+-------------------------+
|2023-01-01|1001| 5| 0| 0|
|2023-01-15|1001| 3| 0| 1|
|2023-02-10|1001| 1| 3| 1|
|2023-02-20|1001| 2| 1| 2|
|2023-01-02|1002| 7| 0| 0|
|2023-01-02|1002| 6| 7| 1|
|2023-01-03|1002| 1| 13| 2|
+----------+----+-----+----------------+-------------------------+
I suspect one could do it using Window
but am not clear about the explicit details.