0

I am using a horizontal pod autoscaler (hpa) in AKS (I will show this file below). My containers run a Flask API server that handles a post request. I used this line to run flask to make it threaded:

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=5003, threaded=True)

I do 20 calls on my Flask running locally and it is able to handle it, albeit very slowly. I do 20 calls on my AKS, the first time (so there is only 1 pod running)it gives me error responses. The second time, I get 20 responses without any errors (the number of pods has increased)

Now I am trying to figure out why it does not wait for an old pod to become available or for a new pod to be created. I thought that there was part of AKS that would do that.

Please let me know if I am missing something!

Deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: *hidden*
spec:
  selector:
    matchLabels:
      app: *hidden*
  template:
    metadata:    
      labels:
        app: *hidden*
    spec:
      containers:
      - name: *hidden*
        image: *hidden*
        env:
        - name: *hidden*
          valueFrom:
            secretKeyRef:
              name: *hidden*
              key: *hidden*
        imagePullPolicy: Always
        resources:
          requests:
            cpu: "300m"
            memory: "400Mi"
          limits:
            cpu: "300m"
            memory: "400Mi"
        ports:
        - containerPort: 5003

      imagePullSecrets:
      - name: *hidden*
    ---

apiVersion: v1
kind: Service
metadata:
  name: *hidden*
spec:
  selector:
    app: *hidden*
  ports:
  - port: 5003
    protocol: TCP
    targetPort: 5003
  type: LoadBalancer

hpa.yaml:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: *hidden*
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: *hidden*
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 20
  behavior:
    scaleUp:
      policies:
      - type: Pods
        value: 20
        periodSeconds: 60
    scaleDown:
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60```
  

1 Answers1

0

From your description, it seems like your Flask API server is not able to handle the load of 20 requests at once. This can be due to insufficient CPU and memory resources allocated to your containers in the AKS cluster.

When you send 20 requests at once, the first pod might get overwhelmed with requests and start responding with error messages. However, when the HPA kicks in and scales up the number of pods, the load is distributed among multiple pods, allowing them to handle the requests without any errors.

By increasing the resource requests and limits to your deployment configuration, you can ensure that each pod has sufficient resources allocated to handle the expected load. This can help avoid errors due to resource exhaustion and provide a better experience for your users.

In your deployment configuration, you have set the resource requests and limits to "300m" CPU and "400Mi" memory. You may need to increase these values based on the load that your Flask API server is expected to handle. You can also use tools like Kubernetes Dashboard or Prometheus to monitor the resource usage of your containers and adjust the resource limits accordingly.

try to increase CPU and memory based on your findings in below section for requests and limits

resources:
  limits:
    cpu: "500m"
    memory: "512Mi"
  requests:
    cpu: "250m"
    memory: "256Mi"

More on requests and limits here

Also will suggest to use readiness probe, readiness probe can also help in such cases, it can help ensure that your application is fully available before it receives any traffic. In the case of your Flask application running in Kubernetes, a readiness probe can be useful to verify that the application has fully started up and is ready to receive traffic. This can help avoid situations where the application is not fully available when the service is started, which can result in errors or delays for clients.

To configure the readiness probe, you can add the following section to your deployment spec:

spec:
  containers:
  - name: <container-name>
    readinessProbe:
      httpGet:
        path: /<health-check-endpoint>
        port: <container-port>
      initialDelaySeconds: 10
      periodSeconds: 5

Replace with the name of your container, with the endpoint that your Flask application exposes to check its health status, and with the port that your Flask application listens to.

For example, if your Flask application exposes a health check endpoint at /health and listens to port 5003, the readiness probe configuration would look like this:

spec:
  containers:
  - name: *hidden*
    image: *hidden*
    ...
    readinessProbe:
      httpGet:
        path: /health
        port: 5003
      initialDelaySeconds: 10
      periodSeconds: 5

This will ensure that Kubernetes only sends traffic to the container when it is actually ready to handle it.

You can get a good understanding of readiness probe here

Just for more clarity in the case of a Flask app, you can use a route to implement a readiness probe. Here's an example:

from flask import Flask, jsonify

app = Flask(__name__)

@app.route('/healthz')
def healthz():
    return jsonify({'status': 'ok'})

@app.route('/api')
def api():
    # your API logic here
    return jsonify({'result': 'success'})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

In this example, we've added a new route '/healthz' that returns a JSON response indicating that the app is healthy. You can use this route as your readiness probe. This flask code is just for explaining , this might contain errors

To specify the readiness probe in your deployment, you can add the following to your container spec:

readinessProbe:
  httpGet:
    path: /healthz
    port: 5000
  initialDelaySeconds: 10
  periodSeconds: 5

This specifies that the readiness probe should perform an HTTP GET request to the '/healthz' endpoint on port 5000, with an initial delay of 10 seconds and a period of 5 seconds between probes.

Once you've added this to your deployment, Kubernetes will use the readiness probe to determine when your containers are ready to receive traffic. If the readiness probe fails, Kubernetes will stop sending traffic to that container until it becomes ready again.

Shane Warne
  • 1,350
  • 11
  • 24
  • Thank you so much for the detailed response! Regarding the first part of your response, is it correct to say for example: “1 pod is to be able to handle 10 responses, if you put the hpa at a 50% a new pod will be made when there are 5 responses. If I receive another 6 requests before that other pod is ready, and before the first pod has any available space, 1 request will receive an error response.”? (...) – Jens Voorpyl Apr 21 '23 at 07:38
  • (...) At this point in time my pod can only handle 1 (maybe 2 if the request is small) request at the same time, so if I send 20 requests at the same time, it is to be expected that 19 or 18 will receive an error response? I thought that AKS would just put the requests on hold while hpa made a new pod, but I suppose that is not the case? (if this is the case, is there any way to increase this ‘hold-time’?) I will look into the readiness probe as well! – Jens Voorpyl Apr 21 '23 at 07:38
  • Your pods are scaling based on cpu metrics not requests , and from what i see it is 20%, if the the cpu utilization goes above 20 it will scale up. did you try increasing the requests and limits for cpu and ram for container – Shane Warne Apr 21 '23 at 09:28
  • In your example, if one pod can only handle 10 requests and you have set the HPA to scale at 50%, then a new pod will be created after 5 requests have been served by the first pod. If you receive 6 more requests before the new pod is ready, then one request will fail as the first pod has reached its capacity limit. – Shane Warne Apr 21 '23 at 09:29
  • you can fine-tune the HPA configuration and set the target CPU and memory utilization to a level that can handle the expected workload without creating too many pods. – Shane Warne Apr 21 '23 at 09:29
  • You can also achieve this is by using a load balancer in front of your service. When a request comes in, the load balancer will distribute the request to one of the available pods. If there are no available pods, the load balancer can either queue the request or return an error response to the client. – Shane Warne Apr 21 '23 at 09:41
  • Like a load balancer, the Ingress controller can also queue incoming requests until a pod is available to handle them. – Shane Warne Apr 21 '23 at 09:42
  • This stackoverflow link can help your usecase of limiting requests https://stackoverflow.com/questions/65598713/k8s-ingress-how-to-limit-requests-in-flight-per-pod vey well explained – Shane Warne Apr 21 '23 at 09:43
  • I am trying to use NGINX to use it's request queue. Thanks for the link! I cannot find a lot of resources on the subject. – Jens Voorpyl Apr 21 '23 at 10:15
  • I am wondering though why the queue isn't a standard AKS component. What are you normaly supposed to do when you receive more requests than your currently available pods can handle? Just tell your users to try again later? – Jens Voorpyl Apr 21 '23 at 10:16