0

mouthful title but the point is this, I have some data science pipelines with these requirements (python based):

  1. are orchestrated with an "internal" orchestrator based off on a server
  2. are run across a number of users/products /etc where N could be relatively high
  3. the "load" of this jobs I want to distribute and not be tethered by the orchestrator server
  4. these jobs are backed by a docker image
  5. these jobs are relatively fast to run (from 1 second to 20 seconds, post data load)
  6. these jobs most often require considerable I/O both coming in and out.
  7. no spark required
  8. I want minimal hassle with scaling/provisioning/etc
  9. data (in/out) would be stored in either a HDFS space in a cluster or AWS S3
  10. docker image would be relatively large (encompasses data science stack)

I was trying to understand the most (a) cost-efficient but also (b) fast solution to parallelize this thing. candidates so far:

  1. AWS ECS
  2. AWS lambda with Container Image Support

please note for all intents and purposes scaling/computing within the cluster is not feasible

my issue is that I worry about the tradeoffs about huge data transfers (in aggregate terms), huge costs in calling docker images a bunch of times, time you would spend setting up containers in servers but very low time doing anything else, serverless management and debugging when things go wrong in case for lambda functions.

how generally are handled these kind of cases?

Asher11
  • 1,295
  • 2
  • 15
  • 31
  • Hey Asher I'm gonna comment on mreferre answer below because he's saying a lot of good things, so I might as well add to that rather than start competing answer. – Mrk Fldig Sep 22 '21 at 17:30

1 Answers1

1

This is a very good question. First and foremost I would assume you are comparing Lambda to ECS/Fargate (more here for background re Fargate). While many considerations holds true for ECS/EC2, ECS/Fargate is a closer model to Lambda.

Having that said, Fargate and Lambda are different enough that it's hard to make an apple to apple comparison between the two without taking into account their different programming and execution models (event driven Vs service based). This isn't to say that you can't invoke batch jobs to run on Fargate the way you'd invoke a Lambda but 1) with this relatively short execution time (1-20 seconds) and 2) at the scale you are alluding to ... invoking a Fargate task on-demand per execution unit may be too penalizing (e.g. because of the limited granularity of the size of the task and because of the task start times in range of 30-60 seconds compared to Lambda's milliseconds). A better comparison in this case would be a Lambda invocation model per job Vs a number of running (and scalable horizontally) ECS/Fargate tasks that can support multiple jobs per task.

Something that you are not mentioning in your analysis is whether these jobs already exist or they exist and would need to be adapted for one or more of these different models (Lambda 1:1, Fargate 1:1, Fargate 1:many). Some customers may decide to stick to a specific model because they can’t afford to tweak the existing code base.

In general I would say that, if the sw needs to be created from scratch, the Lambda model with its hands-off approach seems to be a slightly better fit for this use case.

But in terms of what will be cheaper it’s a hard call to make “on theory”.

mreferre
  • 5,464
  • 3
  • 22
  • 29
  • Hey mreferre, I was gonna answer but you're on the right track so I'll drop comments here - based on the info they could just create a custom lambda container - BUT lambda fans out massively when ya stick this kinda work on it so where does the data go, if it's a DB concurrent connections could freak out. These are thoughts for the conversation rather than answers etc – Mrk Fldig Sep 22 '21 at 17:27
  • With EC2 backed stuff(fargate etc) you could fire up instances with a shared EBS volume, thus satisfying the high IO problem. – Mrk Fldig Sep 22 '21 at 17:29
  • thank you for the detailed answer. basically it would be a deployment option for a data science pipeline package I have developed and already use extensively and want to take to the next level and these requirements basically are 99% of my use cases. would the shared EBS volume "hold" with high concurrency? they are fast but like a 1000 concurrent accesses fast? Cost of course is essential as potentially lambda functions "spoil" you in terms of efficiency and your budget gets butchered fast – Asher11 Sep 22 '21 at 18:26
  • little update: also I tend to serialize everything so no DB stress for now. I see the shared EBS volume has got limitations on the number of EC2 instances one can attach. this is a headache as I would need to set up a sort of load balancer on this or use multiple EBS volumes and re-direct traffic as needed. I would definetely avoid even though I see the appeal for other applications – Asher11 Sep 22 '21 at 19:01
  • I love this thread. For completeness, neither Lambda nor Fargate support EBS volumes (and if/when they do it's not going to be "shared" because of... block storage). If you want to use shared storage your only option is EFS (which may or may not work based on "high concurrency" requirements). More info [here for Fargate](https://aws.amazon.com/blogs/containers/developers-guide-to-using-amazon-efs-with-amazon-ecs-and-aws-fargate-part-1/) and [here for lambda](https://aws.amazon.com/blogs/compute/using-amazon-efs-for-aws-lambda-in-your-serverless-applications/) – mreferre Sep 23 '21 at 06:34