Cloud Run/Gunicorn giving 502 error after one minute

Question

I'm deploying a python application in Google Cloud Run that uses Gunicorn. Both my gunicorn and cloud run timeout are set to 900 seconds, which is also the timeout for Cloud Run. Strangely, when I call the function, I get a 502 error from Cloud Run if the application runs for more than 60 seconds, and not if it runs less than 60 seconds. For example, the deployed function below threw this error:

def process_file(request=request):
    time.sleep(61)
    ...
    return handle_response()

However, if I changed the sleep to 40 seconds:

def process_file(request=request):
    time.sleep(40)
    ...
    return handle_response()

there was no 502 error. I thought at first that the issue was caused by nginx, which has a 60 second default timeout, but it does not seem like nginx is deployed with docker or cloud run by default, so this doesn't seem like the cause of the issue. My Dockerfile is below:

FROM continuumio/miniconda3

# Install production dependencies.
RUN conda install numpy==1.17.2
RUN conda install xlsxwriter==1.1.2
RUN conda install pandas==0.25.1
RUN conda install -c conda-forge ciso8601
RUN pip install gunicorn flask gevent flask_mail flask-cors pyjwt firebase_admin networkx datefinder google-cloud-pubsub 

# Copy local code to the container image.
COPY app.py .
RUN mkdir backend/
COPY backend/ /backend/

# Service must listen to $PORT environment variable.
# This default value facilitates local development.
ENV PORT 8080

# Run the web service on container startup. Here we use the gunicorn
# webserver, with one worker process and 8 threads.
# For environments with multiple CPU cores, increase the number of workers
# to be equal to the cores available.
CMD exec gunicorn --bind 0.0.0.0:$PORT --workers 1 app:app --timeout 900 --log-level debug

I am calling the cloud run using axios in the frontend, which from my understanding does not have a timeout, so I do not believe that this should be an issue. Any help is appreciated, thanks!

EDIT: Here is an image of the error message in the chrome console - does not seem to be very helpful though:

The default Cloud Run timeout is 5 minutes. Run Gunicorn with `--log-level debug` to see if gunicorn is the problem or something else. Cloud Run does not use Nginx as a frontend. The GFE (Google Front End) sits in front of Cloud Run. Cloud Run supports the `--timeout=N`. Try setting that to a specific value like 120. Review the Stackdriver logs for your Cloud Run service instance. Edit your question with your findings. — John Hanley, Nov 22 '20 at 06:04
I set the timeout to 15 minutes (900 seconds) so that can't be the problem; I'll update the question with that. As you can see from the dockerfile I've already added debug and the timeout flag - nothing shows in the logs that is useful. — Alex, Nov 22 '20 at 07:03
Nope, the 502 is only showing in the frontend. There is no trace of it in the stackdriver. I should probably note that the cloud run function does in fact complete execution successfully. — Alex, Nov 22 '20 at 07:18
That means Cloud Run is probably not the culprit by itself. Look at the browser debugger. Does anything show up that might point to the issue? Are you using a load balancer in front of Cloud Run? HTTP Error 502 means bad gateway. In my experience this means a failure between the front end and the back end (the backend fails to respond in time). Edit your question with more details on your architecture. — John Hanley, Nov 22 '20 at 08:27
I added a screenshot of the error message from the chrome console, but it doesn't seem very useful. — Alex, Nov 22 '20 at 16:28

score 0 · Answer 1 · answered Nov 22 '20 at 11:37

We have encountered a similar issue. Probably the GCP internal load balancer in front of your cloud run can't pass the request to the instance. This means that some processes made the cloud run instance stall after 60 seconds, so that it does not receive any request. According to this post, it might have something to do with cloud run interfering with the gunicorn workers. Since cloud run (managed) is a serverless environment, the order in which workers and code are loaded and shut down matters. You could try setting --preload and --timeout=0. Another article suggests a similar thing.

Tried adding all of those flags, still having the issue :( – Alex Nov 22 '20 at 16:28 — Alex, Nov 22 '20 at 16:28

score 0 · Accepted Answer · answered Dec 08 '20 at 06:57

Figured out the issue. I was sending HTTP POST requests to a Firebase hosted domain. Firebase hosted domain POST requests time out after 60 seconds (see Firebase-Hosted Cloud Function retrying on any request that takes 60s, even when timeout is >60s) - the solution was to call the Cloud Run url directly instead.

Cloud Run/Gunicorn giving 502 error after one minute

2 Answers2