How to track global task/job failure rate across cluster

Question

I want to track the global failure rates for jobs/tasks/stages across all nodes in the cluster. Currently the idea is to parse log files in HDFS written by the history server and obtain this data, but this seems clunky. Are there any better approaches? Ideally I would have access to this information per job submitted client side, but this doesn't seem to be the case. What is the recommended way to approach this?

Fabio Manzano · Accepted Answer · 2019-09-26T18:11:06.983

One idea is to extend SparkListener and gather metrics around failures to wherever you want (e.g. push events to ELK).

Some useful events:

case class SparkListenerExecutorBlacklisted(
    time: Long,
    executorId: String,
    taskFailures: Int)
  extends SparkListenerEvent

case class SparkListenerExecutorBlacklistedForStage(
    time: Long,
    executorId: String,
    taskFailures: Int,
    stageId: Int,
    stageAttemptId: Int)
  extends SparkListenerEvent

case class SparkListenerNodeBlacklistedForStage(
    time: Long,
    hostId: String,
    executorFailures: Int,
    stageId: Int,
    stageAttemptId: Int)
  extends SparkListenerEvent

case class SparkListenerNodeBlacklisted(
    time: Long,
    hostId: String,
    executorFailures: Int)
  extends SparkListenerEvent

And listeners:

def onExecutorBlacklisted(executorBlacklisted: SparkListenerExecutorBlacklisted): Unit
def onExecutorBlacklistedForStage(executorBlacklistedForStage: SparkListenerExecutorBlacklistedForStage): Unit
def onNodeBlacklistedForStage(nodeBlacklistedForStage: SparkListenerNodeBlacklistedForStage): Unit
def onNodeBlacklisted(nodeBlacklisted: SparkListenerNodeBlacklisted): Unit

Note that you may subscribe the listener via Spark context's addSparkListener. More details in this other Stack Overflow thread: How to implement custom job listener/tracker in Spark?

Note: to make it work with PySpark, follow the steps described in this other Stack Overflow thread: How to add a SparkListener from pySpark in Python?

This seems very useful, but it doesn't seem to apply to PySpark — user12121909, Sep 26 '19 at 17:11
Added a note to the answer referencing another Stack Overflow thread with steps to achieve it. Thanks! — Fabio Manzano, Sep 26 '19 at 18:12

How to track global task/job failure rate across cluster

1 Answers1