2

My Environment:

  • Databricks 10.4
  • Pyspark

I'm looking into Spark performance and looking specifically into memory/disk spills that are available in Spark UI - Stage section.

What I want to achieve is to get notified if my job had spills.

I have found something below but I'm not sure how it works: https://spark.apache.org/docs/3.1.3/api/java/org/apache/spark/SpillListener.html

I want to find a smart way where major spills are rather than going though all the jobs/stages manually.

ideally, I want to find spills programmatically using pyspark.

BI Dude
  • 1,842
  • 5
  • 37
  • 67

1 Answers1

1

You can use SpillListener class as shown below,

spillListener = spark._jvm.org.apache.spark.SpillListener()
print(spillListener.numSpilledStages())

If you need more details, you have to extend that class and override the methods.

But, I think we can't add custom listeners directly on PySpark we have to do it via Scala. Refer this.

You can refer this page to see how we can implement SpillListener in scala.

Mohana B C
  • 5,021
  • 1
  • 9
  • 28