I have been following some of the code here: pyspark: rolling average using timeseries data
and found that after running some rolling average window operations over a data frame of about 25 billion rows, it will take the cluster hours to run: display(df)
after the operation to inspect its results.
What could be causing this and how might I get around it?
I am doing operations like this:
w = (Window()
.partitionBy(f.col("id"))
.orderBy(f.col("time").cast('long'))
.rangeBetween((-3600*2), Window.currentRow)) # 3 hours
df = df.withColumn('rolling_temp', f.avg("temp").over(w))
I have tried filtering the dataframe after this operation to reduce the number of rows as well as using coalesce
to take into account the smaller size to no avail.