0

I have thousands of log files in a cloud storage bucket that I need to process and aggregate using an HTTP triggered cloud function and am looking for an approach to compute the task in the fastest possible way using parallelization.

At the moment, I have two cloud functions (nodejs 8):

The "main" function which a user is calling directly passing a list of log files that need to be processed; the function calls the "child" function for each provided log file that I also trigger with an HTTP request run parallel using async.each. The "child" function processes a single log file and returns the data to the "main" function which aggregates the results and, once all files are processed, sends the results back to the user.

If I call a child function directly, it takes about 1 second to complete a single file. I'd hope that if I call the main function to process 100 files in parallel the time will still be more or less 1 second. The first file in a batch is indeed returned after 1 second, but the time increases with every single file and the 100th file is returned after 7 seconds.

The most likely culprit is the fact that I'm running the child function using an HTTP request, but I haven't found a way to call them "internally". Is there another approach specific to Google Cloud Functions or maybe I can somehow optimise the parallelisation of HTTP requests?

Ann Su
  • 1

1 Answers1

1

The easiest approach is to simply share the code that does whatever the child function does, and invoke it directly from the main function. For some cases, it's simply easier and costs less due to fewer function invocations.

See also: Calling a Cloud Function from another Cloud Function

Doug Stevenson
  • 297,357
  • 32
  • 422
  • 441
  • but, if I understand you correctly, then I will hit the memory and CPU limits right away as I will have just one instance running instead of thousands of instances in parallel. – Ann Su May 05 '19 at 19:03
  • Do you need thousands running in parallel? What are your actual requirements here? Your problem description is not very specific about what you're *actually* hoping to accomplish, – Doug Stevenson May 05 '19 at 19:05
  • I'm hoping to process as many log files as possible per user's request in the shortest possible time using a serverless stack on Google: Cloud Functions (instead of Compute Engine) and Storage (instead of SSD disks). Both CF and Storage are significantly slower but I hope to achieve near-linear parallelization within CF limits and significantly outperform current setup. – Ann Su May 05 '19 at 19:32
  • If you don't have specific scaling requirements, could you try my suggestion? – Doug Stevenson May 05 '19 at 19:34
  • Also, I hope it's clear that you can't optimize for both speed and scalability here, using Cloud Functions. You will have to pick one, and the one you choose is going to depend on your anticipated load. You may also want to give some thought to whether or not your "sub" function can happen asynchronously, or if it needs to complete before the main function can complete. – Doug Stevenson May 05 '19 at 19:38
  • Let me give you a benchmark. Using Compute Engine and SSD disk, on a single thread, I can process 5000 log files in about 180 seconds. This is my minimum target. If I take the same, non-parallelized approach using Cloud Functions and Storage it will take me over at minimum 5000 seconds - that 30x slower. Google Cloud Function maximum quota (source: https://cloud.google.com/functions/quotas) is 1000 concurrent invocations so if I find a way to fully use the quota, I will get results much faster. – Ann Su May 05 '19 at 19:58
  • Cloud Functions isn't intended for heavy or lengthy compute loads. I would stick with Compute Engine. I think you will just have problems wedging Cloud Functions into this sort of workflow. – Doug Stevenson May 05 '19 at 20:12