To preface this, I know an HTTP request that's longer than 1 minute is bad design and I'm going to look into Cloud Tasks but I do want to figure out why this issue is happening.
So as specified in the title, I have a simple API request to a Cloud Run service (fully managed) that takes longer than 1 minute which does some DB operations and generates PDFs and uploads them to GCS. When I make this request from the client (browser), it consistently gives me back a 502 response after 1 minute of waiting (presumably coming from the HTTP Load Balancer):
However when I look at the logs the request is successfully completed (in about 4 to 5 min):
I'm also getting one of these "errors" for each PDF that's being generated and uploaded to GCS, but from what I read these shouldn't really be the issue?:
To verify that it's not just some timeout issue with the application code or the browser, I put a 5 min sleep on a random API call on a local build and everything worked fine and dandy.
I have set the request timeout on Cloud Run to the maximum (15min), the max concurrency to the default 80, amount of CPU and RAM to 2 and 2GB respectively and the timeout on the Fastify (node.js) server to 15 min as well. Furthermore I went through the logs and couldn't spot an error indicating that the instance was out of memory or any other error around the time that I'm receiving the 502 error. Finally, I also followed the advice to use strace to have a more in depth look at system calls, just in case something's going very wrong there but from what I saw, everything looked fine.
In the end my suspicion is that there's some weird race condition in routing between the container and gateway/load balancer but I know next to nothing about Knative (on which Cloud Run is built) so again, it's just a hunch.
If anyone has any more ideas on why this is happening, please let me know!