Cloud Run Flask API container running shutit enters a sleep loop

Question

The issue has appeared recently and the previously healthy container now enters a sleep loop when a shutit session is being created. The issue occurs only on Cloud Run and not locally.

Minimum reproducible code:

requirements.txt

Flask==2.0.1
gunicorn==20.1.0
shutit

Dockerfile

FROM python:3.9

# Allow statements and log messages to immediately appear in the Cloud Run logs
ENV PYTHONUNBUFFERED True

COPY requirements.txt ./
RUN pip install -r requirements.txt

# Copy local code to the container image.
ENV APP_HOME /myapp
WORKDIR $APP_HOME
COPY . ./

CMD exec gunicorn \
 --bind :$PORT \
 --worker-class "sync" \
 --workers 1 \
 --threads 1 \
 --timeout 0 \
 main:app

main.py

import os
import shutit
from flask import Flask, request

app = Flask(__name__)

# just to prove api works
@app.route('/ping', methods=['GET'])
def ping():
    os.system('echo pong')
    return 'OK'

# issue replication
@app.route('/healthcheck', methods=['GET'])
def healthcheck():
    os.system("echo 'healthcheck'")
    # hangs inside create_session
    shell = shutit.create_session(echo=True, loglevel='debug')
    # never shell.send reached 
    shell.send('echo Hello World', echo=True)
    # never returned
    return 'OK'

if __name__ == '__main__':
    app.run(host='127.0.0.1', port=8080, debug=True)

cloudbuild.yaml

steps:
  - id: "build_container"
    name: "gcr.io/kaniko-project/executor:latest"
    args:
      - --destination=gcr.io/$PROJECT_ID/borked-service-debug:latest
      - --cache=true
      - --cache-ttl=99h
  - id: "configure infrastructure"
    name: "gcr.io/cloud-builders/gcloud"
    entrypoint: "bash"
    args:
      - "-c"
      - |
        set -euxo pipefail

        REGION="europe-west1"
        CLOUD_RUN_SERVICE="borked-service-debug"

        SA_NAME="$${CLOUD_RUN_SERVICE}@${PROJECT_ID}.iam.gserviceaccount.com"

        gcloud beta run deploy $${CLOUD_RUN_SERVICE} \
          --service-account "$${SA_NAME}" \
          --image gcr.io/${PROJECT_ID}/$${CLOUD_RUN_SERVICE}:latest \
          --allow-unauthenticated \
          --platform managed \
          --concurrency 1 \
          --max-instances 10 \
          --timeout 1000s \
          --cpu 1 \
          --memory=1Gi \
          --region "$${REGION}"

cloud run logs that get looped:

Setting up prompt
In session: host_child, trying to send: export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
================================================================================
Sending>>> export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'<<<, expecting>>>['\r\nORIGIN_ENV:rkkfQQ2y# ']<<<
Sending in pexpect session (68242035994000): export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
Expecting: ['\r\nORIGIN_ENV:rkkfQQ2y# ']
export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
root@localhost:/myapp# export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
Stopped sleep .05
Stopped sleep 1
pexpect: buffer: b'' before: b'cm9vdEBsb2NhbGhvc3Q6L3B1YnN1YiMgIGV4cx' after: b'DQpPUklHSU5fRU5WOnJra2ZRUTJ5IyA='
Resetting default expect to: ORIGIN_ENV:rkkfQQ2y# 
In session: host_child, trying to send: stty cols 65535
================================================================================
Sending>>> stty cols 65535<<<, expecting>>>ORIGIN_ENV:rkkfQQ2y# <<<
Sending in pexpect session (68242035994000): stty cols 65535
Expecting: ORIGIN_ENV:rkkfQQ2y# 
ORIGIN_ENV:rkkfQQ2y# stty cols 65535
stty cols 65535
Stopped stty cols 65535
Stopped sleep .05
Stopped sleep 1

Workarounds tried:

Different regions: a few European(tier 1 and 2), Asia, US.
Build with docker instead of kaniko
Different CPU and Memory allocated to the container
Minimum number of containers 1-5 (to ensure CPU is always allocated to the container)
--no-cpu-throttling also made no difference
Maximum number of containers 1-30
Different GCP project
Different Docker base images (3.5-3.9 + various shas ranging from a year ago to recent ones)

Cloud Run does not support background tasks. When your Flask app returns the HTTP response, Cloud Run will idle the CPU. Your background tasks will then not have CPU time. — John Hanley, Sep 21 '21 at 09:00
Is this a new restriction? Because this has been working perfectly fine until last Thursday. — alanmynah, Sep 21 '21 at 09:02
No, this is not a new restriction and has been documented since the first release. You have just been lucky. https://cloud.google.com/run/docs/tips/general — John Hanley, Sep 21 '21 at 09:04
Not sure I follow. It's not really used as a background task, because the http response doesn't get returned until the shutit work is done. So CPU should still be allocated. And I can see in the Cloud Run dashboard that CPU is allocated to containers. this hangs `shell = shutit.create_session(echo=True, loglevel='debug')` this never executes `shell.send('echo Hello World', echo=True)`. never returns `return 'OK'` ``` — alanmynah, Sep 21 '21 at 09:06
Did you read the documentation link I sent? Your application is packed in a container. The CPU is allocated to the thread that is running when you receive the HTTP Request. The execution model is HTTP Request/Response. **Shutit** is a wrapper for **Pexpect** which is a Python module for spawning child applications. Child applications run asynchronously to the Cloud Run thread. — John Hanley, Sep 21 '21 at 09:10
Yes, and `when the Cloud Run service finishes handling a request, the container instance's access to CPU will be disabled or severely limited.` Hence if the request isn't finished the CPU is still there. Additionally, in my original question, in the "workarounds tried" was minimum containers allocated 1-5. To ensure that CPU is always there to circumvent that possibility. According to this: https://cloud.google.com/run/docs/configuring/cpu-allocation — alanmynah, Sep 21 '21 at 09:18
updated the original question to clarify that CPU is always there. — alanmynah, Sep 21 '21 at 09:27
Go to the logs. If the **shell.send()** never returns, then your Cloud Run thread should hang. Cloud Run will kill the container and you will see an error log entry. Instead of debating this issue, collect data that provides details on what is actually happening in your application. — John Hanley, Sep 21 '21 at 09:32
It never gets to `shell.send()`. It hangs on `shell = shutit.create_session(echo=True, loglevel='debug')`. The container gets killed after too many `sleep` messages. The output from the logs is also in the original message. Instead of walking through debug again, let's read the message in full, and then write a reply. I've made sure it's a quality question, let's ensure they are quality answers — alanmynah, Sep 21 '21 at 09:43
You have already answered your own question. If your app calls sleep too many times, Cloud Run kills the container. Don't call sleep. Cloud Run is not an asynchronous runtime system. — John Hanley, Sep 21 '21 at 17:27
Where you previously using Cloud Run with shutit to create sessions when you said the containers were healthy and worked fine? — Priyashree Bhadra, Sep 28 '21 at 14:03
That's right. And then all of a sudden it stopped working. Can't even link to any particular release in the notes — alanmynah, Sep 29 '21 at 11:12
I can think of only this [latest release](https://cloud.google.com/run/docs/release-notes#September_13_2021) in Cloud Run but you also mentioned you had tried --no-cpu-throttling for constant CPU allocation. Can you check if there were recent changes/updates on the shutit library you are using? Also have you upgraded to Python3 because python 2.7 is on its sunset and the [Gunicorn documentation](https://docs.gunicorn.org/en/stable/news.html#changelog) says the minimum version is python 3.5. Note : I am just suggesting a few possibilities. — Priyashree Bhadra, Sep 29 '21 at 13:20
Yeah, no worries at all, appreciate your reply! I've tried various versions of python3 (3.5 through to 3.9), latest gunicorn 1 and 2, no recent changes to shutit, the latest release, according to pypi was Jan 11, 2020. — alanmynah, Sep 29 '21 at 14:12
same effect and bash is also a default argument. https://github.com/ianmiell/shutit/blob/master/shutit.py#L38 https://github.com/ianmiell/shutit/blob/master/shutit_global.py#L177 — alanmynah, Sep 30 '21 at 14:00
There is [known issue](https://github.com/ianmiell/shutit#known-issues), can you please try with setting a simple prompt? — Gourav B, Oct 06 '21 at 17:03

Priyashree Bhadra · Accepted Answer · 2021-09-30T14:13:24.747

I have reproduced your issue and we have discussed several possibilities, I think the issue is your Cloud Run not being able to process requests and hence preparing to shut down(sigterm). I am listing some possibilities for you to look at and analyse.

A good reason for your Cloud Run service failing to start is that the server process inside the container is configured to listen on the localhost (127.0.0.1) address. This refers to the loopback network interface, which is not accessible from outside the container and therefore Cloud Run health check cannot be performed, causing the service deployment failure. To solve this, configure your application to start the HTTP server to listen on all network interfaces, commonly denoted as 0.0.0.0.
While searching for the cloud logs error you are getting, I came across this answer and GitHub link from the shutit library developer which points to a technique for tracking inputs and outputs in complex container builds in shutit sessions. One good finding from the GitHub link, I think you will have to pass the session_type in shutit.create_session(‘bash’) or shutit.create_session(‘docker’) which you are not specifying in the main.py file. That can be the reason your shutit session is failing.
Also this issue could be due to some Linux kernel feature used by this shutit library which is not currently supported properly in gVisor . I am not sure how it was executed for you the first time. Most apps will work fine, or at least as well as in regular Docker, but may not provide 100% compatibility.

Cloud Run applications run on gVisor container sandbox(which supports Linux only currently), which executes Linux kernel system calls made by your application in userspace. gVisor does not implement all system calls (see here). From this Github link, “If your app has such a system call (quite rare), it will not work on Cloud Run. Such an event is logged and you can use strace to determine when the system call was made in your app”

If you're running your code on Linux, install and enable strace: sudo apt-get install strace Run your application with strace by prefacing your usual invocation with strace -f where -f means to trace all child threads. For example, if you normally invoke your application with ./main, you can run it with strace by invoking /usr/bin/strace -f ./main

From this documentation, “ if you feel your issue is caused by a limitation in the Container sandbox . In the Cloud Logging section of the GCP Console (not in the "Logs'' tab of the Cloud Run section), you can look for Container Sandbox with a DEBUG severity in the varlog/system logs or use the Log Query:

resource.type="cloud_run_revision"
logName="projects/PROJECT_ID/logs/run.googleapis.com%2Fvarlog%2Fsystem"

For example: Container Sandbox: Unsupported syscall
setsockopt(0x3,0x1,0x6,0xc0000753d0,0x4,0x0)”

By default, container instances have min-instances turned off, with a setting of 0. We can change this default using the Cloud Console, the gcloud command line, or a YAML file, by specifying a minimum number of container instances to be kept warm and ready to serve requests.

You can also have a look at this documentation and GitHub Link which talks about the Cloud Run container runtime behaviour and troubleshooting for reference.

Hi Priyashree, thank you soooo much for a really detailed response! I've gone through the tips you noted one by one: - 0.0.0.0 port. awesome suggestion, wish i tried that before, but unfortunately no dice. Still getting stuck. - `create_session` with `bash` named argument unfortunately also brought no results. - strace was epic! I got `Unsupported syscall process_vm_readv` during startup. Unfortunately it's hard for me to say if that's usual or normal. What do you think? i did try the `--min-instances` during the initial troubleshooting, but that didn't affect the result — alanmynah, Sep 30 '21 at 15:33
Okay now I would want you to try running your application locally on Docker using these [instructions](https://cloud.google.com/run/docs/testing/local) and verify if your applications boots fine locally? — Priyashree Bhadra, Sep 30 '21 at 15:40
Also the container must listen for requests on 0.0.0.0 on the port to which requests are sent. By default, requests are sent to 8080. Add the --min-instances in cloudbuild.yaml file and give it some value for now and then try. — Priyashree Bhadra, Sep 30 '21 at 16:19
Hi Priyashree, really sorry it took me awhile to get back to you. We've created a GKE cluster as a workaround for the problem and now I'm back. I have followed the instructions for local VS Code setup. That was epic, didn't know of this, https://cloud.google.com/code/docs/vscode/developing-a-cloud-run-service but unfortunately it worked. Or fortunately. Not sure. What are you thoughts? The container runs on 8080 and i can curl to both endpoints and i can see shutit output in `Cloud Run: Run/Debug` VSCode Output tab. — alanmynah, Oct 05 '21 at 14:01
Great! So you are able to run it locally. That means there is no issue with the docker image or your configurations.`Unsupported syscall process_vm_readv` - This message indicates that system calls may not be supported in this [Cloud Run](https://cloud.google.com/run/docs/troubleshooting#sandbox)(fully managed) as container instances are sandboxed using the gVisor container runtime sandbox. Its a known issue in Cloud Run fully managed. You can try [Cloud Run for Anthos](https://cloud.google.com/anthos/run/docs/setup) with same configurations and container setup(with less allocated resources) — Priyashree Bhadra, Oct 05 '21 at 16:19
We have chucked the workload into GKE now, so I guess, as a workaround that should be fine, but fully managed Cloud Run was still much much nicer. Oh well, thanks a lot for all your help. Still confused as to why it worked before and not any more. — alanmynah, Oct 06 '21 at 08:44
Yes as per the steps we have followed in the debug process, I am confident that this is a Cloud Run sandbox issue. You can go ahead with GKE as well but to just let you know Cloud Run for Anthos is just a suitable alternative for Cloud Run fully managed specially for these cases. If you think my answer helped you, please consider accepting it. Thanks and have a great day! — Priyashree Bhadra, Oct 06 '21 at 08:51
Is there a chance you might now what gVisor versions we might try to reproduce the possible regression with? — alanmynah, Oct 06 '21 at 15:09

score 0 · Answer 2 · answered Sep 23 '21 at 12:35

It's not a perfect replacement but you can use one of the following instead:

I'm not sure what's the big picture so I'll add various options

For remote automation tasks from a flask web server we're using paramiko for its simplicity and quick setup, though you might prefer something like pyinfra for large projects or subprocess for small local tasks.

Paramiko - a bit more hands-on\manual than shutit, run commands over the ssh protocol.

example:

import paramiko

ip='server ip'
port=22
# you can also use ssh keys
username='username'
password='password'

cmd='some useful command' 

ssh=paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(ip,port,username,password)

stdin,stdout,stderr=ssh.exec_command(cmd)
outlines=stdout.readlines()
resp=''.join(outlines)
print(resp)

more examples

pyinfra - ansible like library to automate tasks in ad-hoc style

example to install a package using apt:

from pyinfra.operations import apt

apt.packages(
    name='Ensure iftop is installed',
    packages=['iftop'],
    sudo=True,
    update=True,
)

subprocess - like Paramiko not as extensive as shutit but works like a charm

Thanks for your reply! This was a bit of a simplified example with shutit, because the app is using it a bit more extensively and I just wanted to narrow down to the smallest possible repro example. But will probably give it a go to see if it'd be quick to rewrite using the suggestions you've provided. Thanks a lot — alanmynah, Sep 23 '21 at 16:01
Afraid, not the underlying issue, but the workarounds are much appreciated too! — alanmynah, Sep 27 '21 at 07:21
If you wouldn't find a resolving answer I'd appreciate accepting the answer — Noam Yizraeli, Sep 27 '21 at 07:39

Cloud Run Flask API container running shutit enters a sleep loop

2 Answers2