Long story short, I have a cron job that uploads a bunch of files into Cloud Storage buckets daily at a specified time. All of these buckets have an associated Pub/Sub Notification topic which triggers on File Creation Event. Each event triggers a Dataflow job to process that file.
Problem is this instantiates 100's of parallel batch processing jobs in a few seconds. Each job slams my downstream services with HTTP requests. The services are unable to scale fast enough and start throwing connection refused errors.
To throttle these requests, I limited the number of workers available for each Dataflow job. Also, increased resources for my downstream services and reduced their targetCPUUtilizationPercentage to 50% to give them time to scale up. All this has drastically reduced the number of errors from 40 million failed requests down to ~50,000.
However, I was wondering if there is a way to queue these batch jobs? Having only 5 to 10 jobs running in parallel will give my downstream services more breathing space.
FlexRS is something I am yet to try but I don't think it will help much since the algorithm optimizes for COST or SPEED. Neither is an issue here.
Note: All my company's infrastructure is GCP based. Feel free to make any other non-queue based suggestions/optimizations.