0

I have been following some of the code here: pyspark: rolling average using timeseries data

and found that after running some rolling average window operations over a data frame of about 25 billion rows, it will take the cluster hours to run: display(df) after the operation to inspect its results.

What could be causing this and how might I get around it?

I am doing operations like this:

w = (Window()
     .partitionBy(f.col("id"))
     .orderBy(f.col("time").cast('long'))
     .rangeBetween((-3600*2), Window.currentRow))  # 3 hours

df = df.withColumn('rolling_temp', f.avg("temp").over(w))

I have tried filtering the dataframe after this operation to reduce the number of rows as well as using coalesce to take into account the smaller size to no avail.

John Stud
  • 1,506
  • 23
  • 46
  • several hours for 25 billion rows sound reasonable to me. It simply takes time for the calculations to run, and the time depends on the hardware that you're using – mck Nov 18 '20 at 18:28
  • You might try filtering **before** doing the operation – mck Nov 18 '20 at 18:29
  • Understandable -- I'll keep playing with it. I need to filter **after** the operation which yields roughly 750 million rows. Running `display(df)` on 750 million rows is taking a very long time, too. – John Stud Nov 18 '20 at 19:13
  • Why would you want to display 750 million rows? Surely you are not going to parse it manually?? – mck Nov 18 '20 at 19:30
  • Well, I just want to glance at the first 1000 rows so I have been using that. – John Stud Nov 18 '20 at 19:34

0 Answers0