3

I am trying to use Prometheus to track the number of requests to my server over time. Since my servers will be auto-scalled horizontally using Google Compute Engine, I can only push my metric to the remote push gateway. My servers will be deleted and re-created at any given time.

The problem is that whenever the new server is created, or even the counter instance is created using the python client library, the count value is reset to 0. I can also see the graph goes up and down, instead of always increasing.

enter image description here

What is the proper way to track the total number of requests using Prometheus when in an auto-scalled environment?

EDIT:

There is another post about the exactly the same problem, just in a little different scenario. Prometheus how to handle counters on server. It seems the servers must somehow track the counter state by themselves. Prometheus only record whatever values sent to it at that point, push or pull. Which means the counter value does not always go up if the servers simply call counter.inc(). In other words, the following statement in the document only apply in the client library side.

A counter is a cumulative metric that represents a single numerical value that only ever goes up.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Andy
  • 1,231
  • 1
  • 15
  • 27

1 Answers1

2

Since my servers will be auto-scalled horizontally using Google Compute Engine, I can only push my metric to the remote push gateway. My servers will be deleted and re-created at any given time.

That's not quite true. You can use service discovery to automatically discover your nodes and have them instrumented and monitored in the usual Prometheus fashion.

The pushgateway is only intended for service-level batch jobs, see https://prometheus.io/docs/practices/pushing/

brian-brazil
  • 31,678
  • 6
  • 93
  • 86
  • Since the servers' existence is dynamic, Prometheus might not retrieve the data in time before the server is removed. However, the problem now is my count value cannot be accumulated across instances and registries. Will this problem be solved if I use pulling instead? The reason to recreating registries every time is because somehow reusing registry will produce 500 Server Error in the push gateway at some point. – Andy Aug 04 '16 at 22:20
  • There's plenty of races in monitoring, and to be honest if you're bringing up/down servers so often that you're losing a noticeable amount of samples then you need to tune the hysteresis on your autoscaling to reduce the oscillation. Aggregating is a matter of taking a rate of the counters and then a sum of that. – brian-brazil Aug 05 '16 at 06:57
  • I think you have a point. I can just aggregate the results to get the grand total. Could you put this in the answer so I can accept it? As for the oscillation, I am trying to minimize the machine cost by using small instances. The side effect is it changes quickly with the traffic status. – Andy Aug 10 '16 at 03:29