4

I'm using Pushgateway with Prometheus and everything is OK but after a couple of weeks Pushgateway collapses ... giving it a look there are tons of metrics that are not used anymore and delete them manually is practically impossible ... so ->

There is a way to expire Pushgateway metrics with a TTL or some other retention settings like by size or by time ? ... or maybe both ?

NOTE: I read at the mailing list of Prometheus a lot of people requiring something like this from one year ago or more ... and the only answer so far is -> this is not the Promethean way to do it ... really ? ... common, if this is a real pain for a lot of people maybe there should be a better way (even if it's not the Promethean way)

Carlos Saltos
  • 1,385
  • 15
  • 15
  • Metrics for batch job are difficult. The team made the decision not because of the *Promethean* way but because it is hard to justify a feature which would mainly lead to anti-patterns. From a practical point of view I would be happy with a little anti-pattern :) – Michael Doubez Aug 24 '20 at 14:39
  • If you need pushing Prometheus metrics to a centralized storage, then take a look at VictoriaMetrics. It supports metrics ingestion via various protocols, including [Prometheus text exposition format](https://victoriametrics.github.io/#how-to-import-data-in-prometheus-exposition-format). – valyala Apr 08 '21 at 12:59

4 Answers4

5

Supposing you want to remove the metrics related to a group when they become too old (for a given definition of too old), you have the metric push_time_seconds which is automatically defined by the pushgateway.

push_time_seconds{instance="foo",job="bar",try="longtime"} 1.598280005888635e+09

With this information, you can write a script that request/grab this metric and identify the old group of data ({instance="foo",job="bar",try="longtime"}) with the value. The API let you remove of metrics related to your old data:

 curl -X DELETE http://pushgateway:9091/metrics/job/bar/instance/foo/try/longtime

This can be done in a few lines of bash script or python.

Michael Doubez
  • 5,937
  • 25
  • 39
5

Did not get a positive response from Prometheus team. So implemented the same.

https://github.com/dinumathai/pushgateway

docker run -d -p 9091:9091 dmathai/prom-pushgateway-ttl:latest --metric.timetolive=60s
Dinu Mathai
  • 471
  • 6
  • 7
2

You can run this as a sidecar container in pushgateway pod.

- name: pushgateway-metrics-purger
  image: <image/with/curl>
  command:
  - sh
  - -c
  - |
    while true
    do
      del_req="curl -X DELETE http://localhost:9091/metrics/job/"
      curl -s http://localhost:9091/metrics | \
      grep push_time_seconds | \
      grep -Ev '^#' | \
      while read line
      do 
        last_pushed=$(printf "%.f" `echo $line | awk '{print $2}'`)
        job_name=$(echo $line | \
                awk -F '}' '{print $1}' | \
                grep -o 'job=.*' | \
                cut -f1 -d ',' | \
                cut -f2 -d'=' | \
                tr -d '"')
        std_unix_time_now=$(date +%s)
        interval_seconds=$((std_unix_time_now - last_pushed))
        [ $interval_seconds -gt 15 ] \
        && eval $del_req$job_name && echo "$(date), Deleted job group - $job_name" \
        || echo "$(date), Purge action skipped. Interval not satisfied" # adjust interval_seconds as per requirement
      done
      sleep 3600
    done
1

Here is an implementation, which worked for many use cases here.

  1. Add a TTL (time-to-live) label to each metric.

  2. Next, periodically run an independent purge script that scans /metrics endpoint and deletes expired metrics based on push_time_seconds.

Adding TTL on publisher side decentralizes lifetime of each metric and makes the solution dynamic, instead of expiring after a fixed interval. Also, my organization didn't want to deviate from the original software (no option for custom docker images).

S2L
  • 1,746
  • 1
  • 16
  • 20