mouthful title but the point is this, I have some data science pipelines with these requirements (python
based):
- are orchestrated with an "internal" orchestrator based off on a server
- are run across a number of users/products /etc where N could be relatively high
- the "load" of this jobs I want to distribute and not be tethered by the orchestrator server
- these jobs are backed by a docker image
- these jobs are relatively fast to run (from 1 second to 20 seconds, post data load)
- these jobs most often require considerable I/O both coming in and out.
- no
spark
required - I want minimal hassle with scaling/provisioning/etc
- data (in/out) would be stored in either a
HDFS
space in a cluster orAWS S3
docker
image would be relatively large (encompasses data science stack)
I was trying to understand the most (a) cost-efficient but also (b) fast solution to parallelize this thing. candidates so far:
AWS ECS
AWS lambda
with Container Image Support
please note for all intents and purposes scaling/computing within the cluster is not feasible
my issue is that I worry about the tradeoffs about huge data transfers (in aggregate terms), huge costs in calling docker
images a bunch of times, time you would spend setting up containers in servers but very low time doing anything else, serverless management and debugging when things go wrong in case for lambda functions.
how generally are handled these kind of cases?