AWX all jobs stop processing and hang indefinitely -- why

Question

Problem

We've had a working Ansible AWX instance running on v5.0.0 for over a year, and suddenly all jobs stop working -- no output is rendered. They will start "running" but hang indefinitely without printing out any logging.

The AWX instance is running in a docker compose container setup as defined here: https://github.com/ansible/awx/blob/5.0.0/INSTALL.md#docker-compose

Observations

Standard troubleshooting such as restarting of containers, host OS, etc. hasn't helped. No configuration changes in either environment.

Upon debugging an actual playbook command, we observe that the command to run a playbook from the UI is like the below:

ssh-agent sh -c ssh-add /tmp/awx_11021_0fmwm5uz/artifacts/11021/ssh_key_data && rm -f /tmp/awx_11021_0fmwm5uz/artifacts/11021/ssh_key_data && ansible-playbook -vvvvv -u ubuntu --become --ask-vault-pass -i /tmp/awx_11021_0fmwm5uz/tmppo7rcdqn -e @/tmp/awx_11021_0fmwm5uz/env/extravars playbook.yml

That's broken down into three commands in sequence:

ssh-agent sh -c ssh-add /tmp/awx_11021_0fmwm5uz/artifacts/11021/ssh_key_data
rm -f /tmp/awx_11021_0fmwm5uz/artifacts/11021/ssh_key_data
ansible-playbook -vvvvv -u ubuntu --become --ask-vault-pass -i /tmp/awx_11021_0fmwm5uz/tmppo7rcdqn -e @/tmp/awx_11021_0fmwm5uz/env/extravars playbook.yml

You can see in part 3, the -vvvvv is the debugging argument -- however, the hang is happening on command #1. Which has nothing to do with ansible or AWX specifically, but it's not going to get us much debugging info.

I tried doing an strace to see what is going on, but for reasons given below, it is pretty difficult to follow what it is actually hanging on. I can provide this output if it might help.

Analysis

So one natural question with command #1 -- what is 'ssh_key_data'?

Well it's what we set up to be the Machine credential in AWX (an SSH key) -- it hasn't changed in a while and it works just fine when used in a direct SSH command. It's also apparently being set up by AWX as a file pipe:

prw------- 1 root root 0 Dec 10 08:29 ssh_key_data

Which starts to explain why it could be potentially hanging (if nothing is being read in from the other side of the pipe).

Running a normal ansible-playbook from command line (and supplying the SSH key in a more normal way) works just fine, so we can still deploy, but only via CLI right now -- it's just AWX that is broken.

Conclusions

So the question then becomes "why now"? And "how to debug"? I have checked the health of awx_postgres, and verified that indeed the Machine credential is present in an expected format (in the main_credential table). I have also verified that can use ssh-agent on the awx_task container without the use of that pipe keyfile. So it really seems to be this piped file that is the problem -- but I haven't been able to glean from any logs where the other side of the pipe (sender) is supposed to be or why they aren't sending the data.

Some searching suggests that the other end of the pipe is written to by [this code](https://github.com/ansible/awx/blob/21895bd09b5c00621c5744e2d997b82e41b60493/awx/main/tasks.py#L1051), and that the purpose of using a pipe is to prevent the decrypted key from touching the disk. Similarly, ssh-agent is used to hold the key in memory. — Nick ODell, Dec 12 '21 at 02:50
Indeed, following further, it looks like the pipe writing is done in ansible-runner, here: https://github.com/ansible/ansible-runner/blob/1.3.4/ansible_runner/utils.py#L337 — Jon, Dec 12 '21 at 02:59
And even in the latest version of ansible-runner (v2.1.1), the code is identical for `open_fifo_write():` — Jon, Dec 12 '21 at 03:05
Did your job ID number exceed 10000 recently? Perhaps you're having this bug: https://github.com/ansible/awx/issues/10489 — Nick ODell, Dec 12 '21 at 03:27
To clarify, only the playbook jobs are currently failing. Inventory sync jobs are all still working. But it is crazy that the last successful playbook job ID was __10987__, (all playbooks started failing with job ID __10993__). That's awfully close to the magic number reported, and reading the comments, it appears there is nothing specific about the number 10000 that triggers the condition (with some offering conflicting numbers). In particular this comment describes the _exact_ symptoms in my question here: https://github.com/ansible/awx/issues/10489#issuecomment-930118691 — Jon, Dec 12 '21 at 03:45
What's really bizarre: I tried debugging by adding in a logging block here: https://github.com/ansible/ansible-runner/blob/83b5d4e688d2563b0fe89e0a184e06879ca73eec/ansible_runner/utils/__init__.py#L392 using the following code: `import logging; logger = logging.getLogger('awx.main.dispatcher'); logger.warning('Please work!')` The log message didn't work -- but the Job did! But just that one. I try again and it hangs. This strongly suggests there is a race condition happening on this pipe. Perhaps the reading is happening too soon before the writing is ready? — Jon, Dec 12 '21 at 04:18
Got the logging to work by changing to `'awx.main.dispatch'` -- that was an oversight originally. The log is written to /var/log/tower/dispatcher.log on the awx_task container. Interestingly, there will be _two_ log statements written, i.e. `2021-12-12 04:31:08,122 WARNING awx.main.dispatch PID:205 Please work! 2021-12-12 04:31:11,356 WARNING awx.main.dispatch PID:205 Please work! ` — Jon, Dec 12 '21 at 04:33
As an update to the most recent comment, the following was discovered in my additional debugging of the AWX dispatch methods -- turns out the SSH key data is written _twice_: once to `$AWX_PRIVATE_DATA_DIR/env/ssh_key`, and again to `$AWX_ISOLATED_DATA_DIR/ssh_key_data`, with `$AWX_PRIVATE_DATA_DIR` being something like `/tmp/awx_11172_ppt3uj4v` and `$AWX_ISOLATED_DATA_DIR` like `/tmp/awx_11172_ppt3uj4v/artifacts/11172`. The first write is done by AWX and the 2nd write is done by ansible-runner. Even though I don't think this is a good design, it's probably done for greater compatibility. — Jon, Dec 13 '21 at 23:11
Anyway, the first write by AWX to `$AWX_PRIVATE_DATA_DIR/env/ssh_key` succeeds -- because this is being picked up and read on the other side by perhaps the same parent process (ansible-runner running under the same python as awx?) -- Crowdstrike script-based execution monitoring doesn't block this. But the 2nd write to `$AWX_ISOLATED_DATA_DIR/ssh_key_data` (the one that needs to be read by `ssh-agent`) fails and is blocked by Crowdstrike. I definitely confirmed thru my debugging that it is definitely not the AWX code that is at fault, although I opened a new issues to suggest on improvements. — Jon, Dec 13 '21 at 23:16
That new issue is https://github.com/ansible/awx/issues/11453 — Jon, Dec 13 '21 at 23:17

score 3 · Accepted Answer · answered Dec 13 '21 at 04:21

3

Had the same issue starting this Friday in the same timeframe as you. Turned out that Crowdstrike (falcon sensor) Agent was the culprit. I'm guessing they pushed a definition update that is breaking or blocking fifo pipes. When we stopped the CS agent, AWX started working correctly again, with no issues. See if you are running a similar security product.

answered Dec 13 '21 at 04:21

Daniel Queen

46
2

1

We had the same issue, almost the same time frame., and it is working after removing the Crowdstirke agent. – Shemeem Dec 13 '21 at 09:43
Was your AWX running in docker containers? – Jon Dec 13 '21 at 21:33
Confirmed, we were at CS v6.31 on this server, but policy was updated on Wednesday evening and that indeed propagated their script-based execution monitoring policy which apparently blocks fifo pipes causing these AWX jobs to fail. – Jon Dec 13 '21 at 22:37
1

This installation was not running in docker containers. It was an RPM-based install. – Daniel Queen Dec 14 '21 at 15:13

user3888177 · Answer 2 · 2021-12-13T18:08:44.813

1

For users of Crowdstrike, the problem is likely related to a policy change implemented by your organization over the weekend:

crowdstrike released version 6.32, which was adopted by many organizations to respond to a log4j vulnerability over the weekend, which introduced some changes around script level inspection.

Script-Based Execution Monitoring is the culprit of the disruption. As other users have said, you can disable crowdstrike entirely and restart AWX jobs to get it working, but for security in production that may not be appropriate.

Instead, you must contact your crowdstrike administrator who will have updated the policy of your instance profile to include Script-Based Execution Monitoring. The policy management GUI has a checkbox which can enable/disable the use of this feature (new in 6.32). Ask them to disable it and send logs to the vendor.

edited Dec 13 '21 at 18:08

answered Dec 13 '21 at 17:59

user3888177

85
1
8

Interesting. Thanks for the feedback. On our server, AWX is running in a container -- Crowdsrike is installed on the host OS (not in the container). However, the version deployed is still 6.31.12803.0 -- possible that this policy can be deployed independent of the CS version number? – Jon Dec 13 '21 at 21:58
Confirmed it affects v6.31 as well -- the policy can include script-based execution monitoring regardless of it being at 6.32 or not. – Jon Dec 13 '21 at 22:37

score 1 · Answer 3 · answered Dec 15 '21 at 06:23

1

Confirmed Crowdstrike policy update was the issue with why Ansible Tower stopped working for 48 hrs at our company as well. Disabling the monitor option allowed jobs to run successfully almost instantly.

answered Dec 15 '21 at 06:23

Nick Hopkins

11
1

score 1 · Answer 4 · answered Dec 21 '21 at 14:42

Crowdstrike has released sensor version 6.32.12905 to address this issue. Whomever configures Falcon in your environment can create/modify the Sensor Update Policy to push the latest version to the affected hosts.

Release Notes: Summary The hotfix release resolves an issue with Script-Based Execution Monitoring in which the Linux sensor might hang a command line that includes a pipe file.

Fixed Resolves an issue with Script-Based Execution Monitoring in which the Linux sensor might hang a command line that includes a pipe file

score 0 · Answer 5 · answered Dec 20 '21 at 16:59

I confirm same situation even if AWX is running on container...

Regarding existing solutions, none fitting for my case:

Disable Crowdstrike service... Impossible for security reason
Asking to disable "Script-Based Execution Monitoring" will be headache in my current position

So I've patched ansible-runner to avoid letting shell/bash (aka Script-based) reading the file and let Python do it

I'm using old version of AWX 15.0.1 and so the corresponding ansible-runner maybe a bit old. But I've checked on master method still exists (but on different file)

I've moved from

def wrap_args_with_ssh_agent(self, args, ssh_key_path, ssh_auth_sock=None, silence_ssh_add=False):
    """
    Given an existing command line and parameterization this will return the same command line wrapped with the
    necessary calls to ``ssh-agent``
    """
    if ssh_key_path:
        ssh_add_command = args2cmdline('ssh-add', ssh_key_path)
    ...

to

def wrap_args_with_ssh_agent(self, args, ssh_key_path, ssh_auth_sock=None, silence_ssh_add=False):
    """
    Given an existing command line and parameterization this will return the same command line wrapped with the
    necessary calls to ``ssh-agent``
    """
    if ssh_key_path:
        ssh_key_content = ""
        with open(ssh_key_path) as f:
            ssh_key_content = f.read()

        ssh_add_command = 'ssh-add - <<< "' + ssh_key_content + '"'

ATTENTION the private key content will be exposed on log. Is simply a POC, I think we can do differently and better (maybe rewriting content on other file? and using that new clone)

AWX all jobs stop processing and hang indefinitely -- why

Problem

Observations

Analysis

Conclusions

5 Answers5