I have thousands of log files in a cloud storage bucket that I need to process and aggregate using an HTTP triggered cloud function and am looking for an approach to compute the task in the fastest possible way using parallelization.
At the moment, I have two cloud functions (nodejs 8):
The "main" function which a user is calling directly passing a list of log files that need to be processed; the function calls the "child" function for each provided log file that I also trigger with an HTTP request run parallel using async.each. The "child" function processes a single log file and returns the data to the "main" function which aggregates the results and, once all files are processed, sends the results back to the user.
If I call a child function directly, it takes about 1 second to complete a single file. I'd hope that if I call the main function to process 100 files in parallel the time will still be more or less 1 second. The first file in a batch is indeed returned after 1 second, but the time increases with every single file and the 100th file is returned after 7 seconds.
The most likely culprit is the fact that I'm running the child function using an HTTP request, but I haven't found a way to call them "internally". Is there another approach specific to Google Cloud Functions or maybe I can somehow optimise the parallelisation of HTTP requests?