Horizontal pod Autoscaler scales custom metric too aggressively on GKE

Question

I have the below Horizontal Pod Autoscaller configuration on Google Kubernetes Engine to scale a deployment by a custom metric - RabbitMQ messages ready count for a specific queue: foo-queue.

It picks up the metric value correctly.

When inserting 2 messages it scales the deployment to the maximum 10 replicas. I expect it to scale to 2 replicas since the targetValue is 1 and there are 2 messages ready.

Why does it scale so aggressively?

HPA configuration:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: foo-hpa
  namespace: development
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: foo
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metricName: "custom.googleapis.com|rabbitmq_queue_messages_ready"
      metricSelector:
        matchLabels:
          metric.labels.queue: foo-queue
      targetValue: 1

Are you sure about `targetValue: 1`? Why this value is so small? I saw samples with recommended value above than 100 — Yasen, Sep 10 '19 at 12:40
@Yasen When setting `targetValue: 100` and having 2 messages in the queue the HPA scales to 2 pods, it seems to be very aggressive, I expect it to be 1 replica — Erez Ben Harush, Sep 10 '19 at 14:43
Would you please read this guide by former Docker developer Jérôme Petazzoni: [Kubernetes Deployments: The Ultimate Guide - Semaphore](https://semaphoreci.com/blog/kubernetes-deployment). It explains why in `k8s` there are two replicas and not one as in `docker` — Yasen, Sep 10 '19 at 16:02

score 4 · Accepted Answer · answered Sep 15 '19 at 18:32

I think you did a great job explaining how targetValue works with HorizontalPodAutoscalers. However, based on your question, I think you're looking for targetAverageValue instead of targetValue.

In the Kubernetes docs on HPAs, it mentions that using targetAverageValue instructs Kubernetes to scale pods based on the average metric exposed by all Pods under the autoscaler. While the docs aren't explicit about it, an external metric (like the number of jobs waiting in a message queue) counts as a single data point. By scaling on an external metric with targetAverageValue, you can create an autoscaler that scales the number of Pods to match a ratio of Pods to jobs.

Back to your example:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: foo-hpa
  namespace: development
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: foo
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metricName: "custom.googleapis.com|rabbitmq_queue_messages_ready"
      metricSelector:
        matchLabels:
          metric.labels.queue: foo-queue
      # Aim for one Pod per message in the queue
      targetAverageValue: 1

will cause the HPA to try keeping one Pod around for every message in your queue (with a max of 10 pods).

As an aside, targeting one Pod per message is probably going to cause you to start and stop Pods constantly. If you end up starting a ton of Pods and process all of the messages in the queue, Kubernetes will scale your Pods down to 1. Depending on how long it takes to start your Pods and how long it takes to process your messages, you may have lower average message latency by specifying a higher targetAverageValue. Ideally, given a constant amount of traffic, you should aim to have a constant number of Pods processing messages (which requires you to process messages at about the same rate that they are enqueued).

score 3 · Answer 2 · answered Sep 11 '19 at 12:54

According to https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

From the most basic perspective, the Horizontal Pod Autoscaler controller operates on the ratio between desired metric value and current metric value:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

From the above I understand that as long as the queue has messages the k8 HPA will continue to scale up since currentReplicas is part of the desiredReplicas calculation.

For example if:

currentReplicas = 1

currentMetricValue / desiredMetricValue = 2/1

then:

desiredReplicas = 2

If the metric stay the same in the next hpa cycle currentReplicas will become 2 and desiredReplicas will be raised to 4

This is exactly it. The HPA is constantly trying to scale up to bring your metric down to the target value and it can't because there is no ratio between # of pods vs # of messages in queue. This is a common pitfall of custom metrics for HPA — Patrick W, Sep 12 '19 at 13:18

score 1 · Answer 3 · answered Sep 10 '19 at 16:08

Try to follow this instruction that describes horizontal autoscale settings for RabbitMQ in k8s

Kubernetes Workers Autoscaling based on RabbitMQ queue size

In particular, targetValue: 20 of metric rabbitmq_queue_messages_ready is recommended instead of targetValue: 1:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: workers-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: my-workers
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metricName: "custom.googleapis.com|rabbitmq_queue_messages_ready"
      metricSelector:
        matchLabels:
          metric.labels.queue: myqueue
      **targetValue: 20

Now our deployment my-workers will grow if RabbitMQ queue myqueue has more than 20 non-processed jobs in total

The problem is the metric won't change based on the number of.pods. with 1 pos there is 1 message in queue, with 20 pods there is still 1 message in queue. HPA is trying to scale up the number of pods to reduce the current metric. — Patrick W, Sep 12 '19 at 13:17

ItayB · Answer 4 · 2021-01-10T19:37:30.157

I'm using the same Prometheus metrics from RabbitMQ (I'm using Celery with RabbitMQ as broker).

Did anyone here considered using rabbitmq_queue_messages_unacked metric rather than rabbitmq_queue_messages_ready?

The thing is, that rabbitmq_queue_messages_ready is decreasing as soon the message pulled by a worker and I'm afraid that long-running task might be killed by HPA, while rabbitmq_queue_messages_unacked stays until the task completed.

For example, I have a message that will trigger a new pod (celery-worker) to run a task that will take 30 minutes. The rabbitmq_queue_messages_ready will decrease as the pod is running and the HPA cooldown/delay will terminate pod.

EDIT: seems like a third one rabbitmq_queue_messages is the right one - which is the sum of both unacked and ready:

sum of ready and unacknowledged messages - total queue depth

documentation

Interesting point, it raises the question what should be the auto scale strategy for pods that run for 30 minutes. I guess it really depends on the business need. — Erez Ben Harush, Dec 08 '20 at 19:47

Horizontal pod Autoscaler scales custom metric too aggressively on GKE

4 Answers4