I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.
I already have software patterns from other projects for Airflow + Batch, but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Other option would be to have one task that kicks off the 10k containers and monitors it from there.
I have no experience with Step Functions, but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Does Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?