4

We have few spark batch jobs and streaming jobs. Spark batch jobs are running on Google cloud VM and Spark streaming jobs are running on Google Dataproc cluster. It is becoming difficult to manage the jobs. So we wanted to implement some mechanism to monitor the jobs' health. Our basic requirement is to know :

  1. What time job started and how much time it took for processing the data.
  2. How many records affected.
  3. Send alert if there is any error.
  4. Visualize the above metrics everyday and take action if required.

I am not well versed with spark domain. I explored the stackdriver logging in Google Dataproc but did not find the logs for streaming jobs on dataproc clusters. I know ELK stack can be used but I wanted to know what is the best practices in spark ecosystem for such kind of requirement. Thanks.

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
Ravi Lohan
  • 73
  • 6

2 Answers2

1

Google Cloud Dataproc writes logs and pushes metrics to Google Stackdriver which you can use for monitoring and alerting.

Take a look at documentation on how to use Dataproc with Stackdriver: https://cloud.google.com/dataproc/docs/guides/stackdriver-monitoring

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
  • Thanks for your ans. We need to write the fluentd conf for stackdriver monitoring to point the logs of the application. But on dataproc we dont find any log file for the application. – Ravi Lohan May 29 '18 at 05:46
0

Adding to what Igor said.

There are metrics in stackdriver for basic things like success/failure and duration, however, nothing like #2.

You can follow this example to create a SparkListener and then report the metrics to Stackdriver API directly.

tix
  • 2,138
  • 11
  • 18
  • Thanks for your ans. I will check the spark listener. Is there any general practice that is used in spark job monitoring? – Ravi Lohan May 29 '18 at 05:48