0

Long story short, I have a cron job that uploads a bunch of files into Cloud Storage buckets daily at a specified time. All of these buckets have an associated Pub/Sub Notification topic which triggers on File Creation Event. Each event triggers a Dataflow job to process that file.

Problem is this instantiates 100's of parallel batch processing jobs in a few seconds. Each job slams my downstream services with HTTP requests. The services are unable to scale fast enough and start throwing connection refused errors.

To throttle these requests, I limited the number of workers available for each Dataflow job. Also, increased resources for my downstream services and reduced their targetCPUUtilizationPercentage to 50% to give them time to scale up. All this has drastically reduced the number of errors from 40 million failed requests down to ~50,000.

However, I was wondering if there is a way to queue these batch jobs? Having only 5 to 10 jobs running in parallel will give my downstream services more breathing space.

FlexRS is something I am yet to try but I don't think it will help much since the algorithm optimizes for COST or SPEED. Neither is an issue here.

Note: All my company's infrastructure is GCP based. Feel free to make any other non-queue based suggestions/optimizations.

Zain Qasmi
  • 345
  • 2
  • 14
  • Is each job perform the same thing (same pipeline) but on different files? What are the size of the files? – guillaume blaquiere Jun 11 '21 at 13:57
  • Yes. Jobs are the same. Largely reading the files (either JSON or CSV) line by line. File size range from 20 MB to up to 4 GB. – Zain Qasmi Jun 11 '21 at 16:46
  • I think you can programmatically create pipelines and run them one by one. I assume you have a streaming job that reads the Pub/Sub to start other Dataflow jobs. If that's the case, instead of using Dataflow to read from the Pub/Sub topic, write a program to subscribe to the topic and enqueue Dataflow pipelines. Pop queue a Dataflow pipeline for execution when the current one has stopped. You can always check the state of a pipeline result. – ningk Jun 14 '21 at 20:03

2 Answers2

0

In a previous company, I had the same issue. We solved it by using streaming:

  • We started a dataflow that listen Pubsub messages, the event published by Cloud Storage when a file comes in
  • For each message, we downloaded the file and created a PCollection for each line of the file (so we didn't use the FileIO libraries, but standard file processing to read line by line)
  • According to the number of line injected in the PCollection, the only existing dataflow scaled up and down (up to 100 workers n1-standard-16 sometime!)

Could be a solution to your issue

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
  • We decided on batch processing because we expect 50 or even 100 gig files in the future. Downloading these to memory in dataflow, even if possible, could be prohibitively expensive. What do you think? – Zain Qasmi Jun 11 '21 at 20:27
  • Download, read line by line to create your PCollection, delete the file. The FileIO does the exact same thing! It's important to perform the operations in the same transform, because if you change of step, it could be executed on another worker, you can can lost context – guillaume blaquiere Jun 11 '21 at 20:31
0

you have already a pub/sub in the architecture you can create a topic and use it like

... Pub/Sub Notification topic which triggers on File Creation Event. Each event publish a message to topic job_queue. A template is used to read from it and triggers a Dataflow job to process that file, details are in the msg send in previous step

How To Rate-Limit Google Cloud Pub/Sub Queue

rio
  • 685
  • 9
  • 16