0

How do I make sure that one container only tries to handle one request? I am running a Flask API server in my container, but it is not designed to handle multiple requests at the same time.

Right now it seems like multiple requests are put into one pod/container as I keep getting an OOMKilled status.

Note that this only happens when I send requests in quick succession, e.g. 3 requests with 3 seconds in between.

Note that I am not 100% sure that this is happening, I find it difficult to define where the requests are going in the AKS cluster. If you have any advice on how to monitor this, I would greatly appreciate it!

I tried to put the resource request and resource limit to the same value in the deployment.yaml like this:

    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
      limits:
        cpu: "100m"
        memory: "128Mi"

This is not my prefered way to solve the problem as most of the time my program only needs 32Mi memory and the 128Mi is not needed that often.

  • The behavior you describe seems normal. A Web server (or any kind of server) will almost always expect to handle concurrent requests, and nothing in Docker or Kubernetes has the ability to serialize requests like you're describing. On modern hardware 128 MiB isn't a whole lot of memory; if memory is your constraint, can you increase this to, say, 1 GiB? – David Maze Apr 19 '23 at 10:15
  • I increased it to 300 cpu and 400 Mi memory. Now I dont get the OOMKilled status anymore! What I do run into is: "0/2 nodes are available: 2 Insufficient cpu. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod." – Jens Voorpyl Apr 19 '23 at 11:26

1 Answers1

0

It is not designed to handle multiple requests at the same time

Well the code is not designed properly then There are limits to throw more servers to solve a code problem.

If I were you, here is what I would do:

  • Fix the code to handle several requests. Maybe you have a memory leak.
  • Increase the memory (double it, and see if it helps)
  • Monitor your app with something like grafana.com to know why it is increasing
  • increase concurrency
  • create an HPA (Horizontal pod autoscaler) based on Memory, when the memory increases to a certain threshold, it will increase your pod count.
  • add readiness probe and configure it in a way that if the pod doesn't answer, the LB won't send requests to the pod.
  • if you really need to process only 1 request at the time, use a queue. An API will put items in a queue when receiving requests, and a worker will process 1 item by 1 item.
François
  • 1,793
  • 14
  • 19
  • I am using a Flask API, it is by default not able to handle multiple requests correct? My flask version is 2.2.3. I don't understand how a memory leak would change anything about this. (thanks for commenting!!) – Jens Voorpyl Apr 19 '23 at 11:23
  • Update: I added "threaded=true" to "app.run(host='0.0.0.0', port=5003, threaded=True)" and am trying if this does work. – Jens Voorpyl Apr 19 '23 at 12:14
  • Check this out: https://stackoverflow.com/questions/10938360/how-many-concurrent-requests-does-a-single-flask-process-receive – François Apr 19 '23 at 12:47
  • Yes, that is where I got the "app.run(host="your.host", port=4321, threaded=True)" from. What else could it help me with I am not sure? – Jens Voorpyl Apr 19 '23 at 12:58
  • Your problem seems more like a python problem than a K8s one and I think all the answers you need are in the other SO thread :) – François Apr 19 '23 at 21:28
  • The OOMKilled problem is solved by increasing the assigned memory in the deployment.yaml file. The 1pod/1request problem is solved by adding the threaded=True option. I am not sure what you mean by "python problem", I read the thread you sent but I am not sure I understand what you mean... Thanks for taking the time to respond!! – Jens Voorpyl Apr 20 '23 at 07:29