MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

Question

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.

My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:

[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL

I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.

At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.

SIGKILL is an indication of out of memory. Not sure if that info is helpful or not. — jordanm, Sep 27 '21 at 17:30
@jordanm Ah that makes sense, but in this case how could I increase memory? I'm already using mw1.large which is the biggest machine MWAA provides. — Kei, Sep 27 '21 at 18:02
Yeah, you are at the limit for MWAA. You can try breaking your job into smaller chunks, otherwise you can get more RAM with airflow by self-managing it instead of MWAA. AWS glue might also be a fit for your use case. — jordanm, Sep 27 '21 at 18:07
It is also probably worth checking the `local_task_job.py` itself to see any way you can reduce memory use. (200M doesn't seem too bad but depends on what you are doing inside, it could be very large) what framework are you using within this script? — Emma, Sep 28 '21 at 16:11
Hello @Emma, I am just running Python and using Boto3 for the S3 buckets stuff, I just found the solution though. Thank you! — Kei, Sep 29 '21 at 21:45

Kei · Accepted Answer · 2021-09-29T21:58:10.927

3

I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs

celery.sync_parallelism = 1
celery.worker_autoscale = 1,1

This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

edited Sep 29 '21 at 21:58

answered Sep 29 '21 at 21:44

Kei

611
2
11
24

Still didn't work for me. – Cybernetic Mar 22 '22 at 17:21
1

@Cybernetic you might want to reach out to AWS support for that. SIGKILL contains a lot of possibilities of what could go wrong. – Kei Mar 23 '22 at 23:44

MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

1 Answers1