We have few spark batch jobs and streaming jobs. Spark batch jobs are running on Google cloud VM and Spark streaming jobs are running on Google Dataproc cluster. It is becoming difficult to manage the jobs. So we wanted to implement some mechanism to monitor the jobs' health. Our basic requirement is to know :
- What time job started and how much time it took for processing the data.
- How many records affected.
- Send alert if there is any error.
- Visualize the above metrics everyday and take action if required.
I am not well versed with spark domain. I explored the stackdriver logging in Google Dataproc but did not find the logs for streaming jobs on dataproc clusters. I know ELK stack can be used but I wanted to know what is the best practices in spark ecosystem for such kind of requirement. Thanks.