Several of our applications have batch jobs that aggregate data every night. These batch jobs, which are Python scripts, use Prometheus Pushgateway to push metric values to Prometheus, and we have rules that trigger alerts (in Alertmanager) when these metrics become invalid (e.g. exceed a certain threshold).
We would now also like to use Prometheus metrics to double-check that the batch jobs itself ran correctly: For example, did the job start on-time? Did any errors occur? Did the job run to completion? To this end, we would like to change our Python scripts to push a metric when the script start and finishes, and when any errors occur. This does raise some problems though: we have quite a few batch jobs and 3 metrics per batch-job creates a lot of manual configuration for rules/alerts; we would also like to display the status graphically in Grafana and aren't really sure what the right visual for that would look like.
Has anyone else tried to tackle a similar problem to use Prometheus metrics to monitor the status of several batch jobs? Which metrics did you record and what did your alerts/rules look like? Did you find a intuitive way to graphically display the status of each batch job?