Task for fargate service terminating within seconds

Question

I have a ECS Cluster on AWS and there are four services running under it. One of the service is a replica type with fargate launch type. It also has a load balancing associated. The OS is Linux 1.4 and number of tasks running are 2 without any auto scaling. The docker image which runs on it is a gunicorn application and the command used to run is below. And the gunicorn application is for running an API on falcon.

["gunicorn","-b","0.0.0.0:80","src.app:run()","-k","gevent","--workers=5"]

For some reason the tasks are getting stopped in every few seconds. In the logs it shows Exit Code 1, and also logs some errors in the cloudwatch.

gunicorn.errors.HaltServer: <HaltServer 'Worker failed to boot.' 3>
raise HaltServer(reason, self.WORKER_BOOT_ERROR)
[10] [ERROR] Exception in worker process

This service is running from past one year and never had this error, and suddenly it stopped working. There is no new code deployed or any development done, hence very unusual to get these errors. The service is configured to start two tasks, so it starts two and then within 2 seconds it stops and another two starts once previous ones stops. And this cycle continues. I have tried deploying the existing code base but still having the same error, I have also updated the service with new task definitions but that also did not fix.

Some additional errors from cloudwatch but does not help much.

[INFO] Starting gunicorn 19.9.0
[INFO] Listening at: http://0.0.0.0:80 (1)
[INFO] Using worker: gevent
/usr/local/lib/python3.10/os.py:1029: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
return io.open(fd, mode, buffering, encoding, *args, **kwargs)
[7] [INFO] Booting worker with pid: 7
[8] [INFO] Booting worker with pid: 8
[9] [INFO] Booting worker with pid: 9
[10] [INFO] Booting worker with pid: 10
[11] [INFO] Booting worker with pid: 11
[7] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
worker.init_process()
File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/ggevent.py", line 203, in init_process
super(GeventWorker, self).init_process()

I have tried running the same docker in my local and getting almost the same error. Now at least the issue is narrow down to the code itself. But still not understand why it was running from years and failed just now. The detailed error is below.

[2022-09-04 08:16:27 +0000] [1] [INFO] Starting gunicorn 19.9.0
[2022-09-04 08:16:27 +0000] [1] [INFO] Listening at: http://0.0.0.0:80 (1)
[2022-09-04 08:16:27 +0000] [1] [INFO] Using worker: gevent
/usr/local/lib/python3.10/os.py:1029: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  return io.open(fd, mode, buffering, encoding, *args, **kwargs)
[2022-09-04 08:16:27 +0000] [7] [INFO] Booting worker with pid: 7
[2022-09-04 08:16:27 +0000] [8] [INFO] Booting worker with pid: 8
[2022-09-04 08:16:27 +0000] [9] [INFO] Booting worker with pid: 9
[2022-09-04 08:16:27 +0000] [7] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/ggevent.py", line 203, in init_process
    super(GeventWorker, self).init_process()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/base.py", line 129, in init_process
    self.load_wsgi()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/base.py", line 138, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/wsgiapp.py", line 52, in load
    return self.load_wsgiapp()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/wsgiapp.py", line 41, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/local/lib/python3.10/site-packages/gunicorn/util.py", line 350, in import_app
    __import__(module)
  File "/src/app.py", line 1, in <module>
    import falcon
  File "/usr/local/lib/python3.10/site-packages/falcon/__init__.py", line 30, in <module>
    from falcon.api import API  # NOQA
  File "/usr/local/lib/python3.10/site-packages/falcon/api.py", line 21, in <module>
    from falcon import api_helpers as helpers, DEFAULT_MEDIA_TYPE, routing
  File "/usr/local/lib/python3.10/site-packages/falcon/api_helpers.py", line 21, in <module>
    from falcon import util
  File "/usr/local/lib/python3.10/site-packages/falcon/util/__init__.py", line 29, in <module>
    from falcon.util import structures
  File "/usr/local/lib/python3.10/site-packages/falcon/util/structures.py", line 35, in <module>
    class CaseInsensitiveDict(collections.MutableMapping):  # pragma: no cover
AttributeError: module 'collections' has no attribute 'MutableMapping'
[2022-09-04 08:16:27 +0000] [7] [INFO] Worker exiting (pid: 7)
[2022-09-04 08:16:27 +0000] [10] [INFO] Booting worker with pid: 10
[2022-09-04 08:16:27 +0000] [8] [ERROR] Exception in worker process

I would test pulling down that docker image and running it locally to see if the same thing happens. Also, did anything in your infrastructure change lately? For example if this app connects to a database, did the security group rules on the database change recently or anything? — Mark B, Sep 01 '22 at 14:03
Also, there should really be a lot more to the error message than what you posted. You could try some of the answers here: https://stackoverflow.com/questions/24488891/gunicorn-errors-haltserver-haltserver-worker-failed-to-boot-3-django — Mark B, Sep 01 '22 at 14:04
I have redeployed the same old code with more dubugging levels. I also checked the docker and it runs locally without any errors. There has been no change in this AWS profile from long time. Few more errors which I can share are: " Essential container in task exited" "Exit Code 1" — Sam, Sep 03 '22 at 06:32

Task for fargate service terminating within seconds

0 Answers0