I'm currently in the process of adding some metrics to an existing pipeline that runs on Google Dataproc via the Spark Runner and I'm trying to determine how to access these metrics and eventually expose them to Stackdriver (to be used downstream in Grafana dashboards).
The metrics themselves are fairly simple (a series of counters) and are defined as such (and accessed in DoFns throughout the pipeline):
object Metrics {
val exampleMetric: Counter = Metrics.counter(ExamplePipeline::class.qualifiedName, "count")
// Others omitted for brevity
}
This metric (and others) are incremented throughout the course of the pipeline in various DoFn
calls and several unit tests confirm that the MetricQueryResults
object from the pipeline is properly populated after executing via the DirectRunner
.
The primary issue here is that I see no indication within Dataproc or any of the related UIs exposed in GCP (YARN ResourceManager, Spark History Server, YARN Application Timeline, etc.) that these metrics are being emitted. I've tried scouring through logs and anywhere else that I can, but I don't see any sign of these custom metrics (or really any metrics in general being emitted from Spark and/or into Stackdriver).
Job Configuration
The Spark job itself is configured through the following command in a script (assuming that the appropriate .jar file has been copied up into the proper bucket in GCP:
gcloud dataproc jobs submit spark --jar $bucket/deployment/example-pipeline.jar \
--project $project_name \
--cluster $cluster_name \
--region $region \
--id pipeline-$timestamp \
--driver-log-levels $lots_of_things_here \
--properties=spark.dynamicAllocation.enabled=false \
--labels="type"="example-pipeline","namespace"="$namespace" \
--async \
-- \
--runner=SparkRunner \
--streaming
Cluster Configuration
The cluster itself appears to have every metric-related property enabled that I could think of such as:
dataproc:dataproc.logging.stackdriver.enable=true
dataproc:dataproc.logging.stackdriver.job.driver.enable=true
dataproc:dataproc.monitoring.stackdriver.enable=true
dataproc:spark.submit.deployMode=cluster
spark:spark.eventLog.dir=hdfs:///var/log/spark/apps
spark:spark.eventLog.enabled=true
yarn:yarn.log-aggregation-enable=true
yarn:yarn.log-aggregation.retain-seconds=-1
Those were just a few of the properties on the cluster, however there are countless others, so if one appears to be missing or incorrect (as it pertains to the metrics story), feel free to ask.
Questions
- It doesn't seem as though these metrics are automatically emitted or visible from Spark (or in Stackdriver), is there some missing configuration there at the cluster/job level? Or something like the
MetricsOptions
interface? - Once we have metrics actually being emitted, I'd assume that Stackdriver has a mechanism to handle consuming these from DataProc (which seemed to sound like what
dataproc:dataproc.monitoring.stackdriver.enable=true
would handle). Is that the case?
I have to imagine this is a fairly common use-case (for Spark / Dataproc / Beam), but I'm unsure of which pieces of the configuration puzzle are missing and documentation/articles related to this process seem quite sparse.
Thanks in advance!