31

I'm trying to reboot server running CentOS 7 on VirtualBox. I use this task:

- name: Restart server
  command: /sbin/reboot
  async: 0
  poll: 0
  ignore_errors: true

Server is rebooted, but I get this error:

TASK: [common | Restart server] ***********************************************
fatal: [rolcabox] => SSH Error: Shared connection to 127.0.0.1 closed.
It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue.

FATAL: all hosts have already failed -- aborting

What am I doing wrong? How can I fix this?

Domen Blenkuš
  • 2,182
  • 1
  • 21
  • 29
  • Based on the answer provided by Marcin Skarbek I prepared and published to Ansible Galaxy role what use that method. The role Reboot-And-Wait you can find [here](https://galaxy.ansible.com/it-praktyk/Reboot-And-Wait/). Thank you for using, feedbacks are welcomed. – Wojciech Sciesinski Mar 17 '17 at 23:22
  • Due to Ansible's quick pace of development, the older answers are not working for me anymore. Please have a look at my answer. – Telegrapher Nov 22 '17 at 00:50

11 Answers11

43

You're likely not doing anything truly wrong, it's just that /sbin/reboot is shutting down the server so quickly that the server is tearing down the SSH connection used by Ansible before Ansible itself can close it. As a result Ansible is reporting an error because it sees the SSH connection failing for an unexpected reason.

What you might want to do to get around this is to switch from using /sbin/reboot to using /sbin/shutdown instead. The shutdown command lets you pass a time, and when combined with the -r switch it will perform a reboot rather than actually shutting down. So you might want to try a task like this:

- name: Restart server
  command: /sbin/shutdown -r +1
  async: 0
  poll: 0
  ignore_errors: true

This will delay the server reboot for 1 minute, but in doing so it should give Ansible enough time to to close the SSH connection itself, thereby avoiding the error that you're currently getting.

approxiblue
  • 6,982
  • 16
  • 51
  • 59
Bruce P
  • 19,995
  • 8
  • 63
  • 73
  • 4
    Thanks, this works great! There was just one small catch: I've got ```Failed to parse time specification: +1m``` error, so I had to replace ```+1m``` with ```+1```. – Domen Blenkuš Apr 30 '15 at 08:23
  • Note: using reboot is not a good idea unless you know what you are doing. It maybe ok in a lot of linux distros, but on other unixes it can bypass a lot of the system shutdown scripts, and is very hard. This can lead to inconsistent dbs etc so bad practice. Better to use shutdown or init. – krad Feb 06 '18 at 16:02
  • What is the purpose of async and poll here? – Basil Musa Jun 19 '18 at 16:32
  • 1
    @Basil, [see the docs here](https://docs.ansible.com/ansible/latest/user_guide/playbooks_async.html). ;) – Paul Hodges Jun 28 '18 at 14:57
12

After the reboot task, you should have a local_action task that waits for the remote host to finish rebooting, otherwise, the ssh connection will be terminated and so is the playbook.


- name: Reboot server
  command: /sbin/reboot

- name: Wait for the server to finish rebooting
  sudo: no
  local_action: wait_for host="{{ inventory_hostname }}" search_regex=OpenSSH port=22 timeout=300

I also wrote a blog post about achieving a similar solution: https://oguya.github.io/linux/2015/02/22/ansible-reboot-servers/

James Oguya
  • 315
  • 2
  • 3
10
- name: restart server
  shell: sleep 2 && shutdown -r now "Ansible updates triggered"
  async: 1
  poll: 0
  become: true
  ignore_errors: true


- name: waiting for the server to come back
  local_action: wait_for host=testcentos state=started delay=30 timeout=300
  sudo: false
Saad
  • 916
  • 1
  • 15
  • 28
7

Another solution:

- name: reboot host
  command: /usr/bin/systemd-run --on-active=10 /usr/bin/systemctl reboot
  async: 0
  poll: 0

- name: wait for host sshd
  local_action: wait_for host="{{ inventory_hostname }}" search_regex=OpenSSH port=22 timeout=300 delay=30

systemd-run creates "on the fly" new service which will start systemctl reboot after 10 sec of delay (--on-active=10). delay=30 in wait_for to add extra 20 sec to be sure that host actually started rebooting.

Marcin Skarbek
  • 114
  • 1
  • 1
6

None of the above solutions worked reliably for me.

Issuing a /sbin/reboot crashes the play (the SSH connection is closed before ansible finished the task, it crashes even with ignore_errors: true) and /usr/bin/systemd-run --on-active=2 /usr/bin/systemctl reboot will not reboot after 2 seconds, but after a random amount of time between 20 seconds and one minute, so the delay is sometime not sufficient and this is not predictable.

Also I don't want to wait for minutes while a cloud server can reboot in few seconds.

So here is my solution:

- name: Reboot the server for kernel update
  shell: ( sleep 3 && /sbin/reboot & )
  async: 0
  poll: 0 

- name: Wait for the server to reboot
  local_action: wait_for host="{{ansible_host}}" delay=15 state=started port="{{ansible_port}}" connect_timeout=10 timeout=180

That's the shell: ( sleep 3 && /sbin/reboot & ) line that does the trick.

Using ( command & ) in shell script runs a program in the background and detaches it: the command succeed immediately but persists after the shell is destroyed.

Ansible get its response immediately and the server reboots 3 seconds later.

cronvel
  • 4,045
  • 2
  • 14
  • 19
  • How does putting the /sbin/reboot command in the background help? – mbigras Oct 05 '18 at 23:33
  • @mbigras Because it doesn't block Ansible's execution, so the second action is executed immediately. Without that, it would not work: Ansible would wait for the command to return, but will lost the connection during the system's shutdown. I believe that latest Ansible releases have better solution, but I didn't investigate further. – cronvel Oct 07 '18 at 09:31
  • @mbigras Anyway, whatever the reason, other answers simply don't work for me reliably. – cronvel Oct 07 '18 at 09:34
5

Ansible is developing quickly and the older answers were not working for me.

I found two issues:

  • The recommended way of rebooting may kill the SSH connection before Ansible finishes the task.

It is better to run: nohup bash -c "sleep 2s && shutdown -r now" &

This will launch a shell with the sleep && shutdown, but will not wait for the shell to end due to the last &. The sleep will give some time for the Ansible task to end before the reboot and the nohup will guarantee that bash doesn't get killed when the task ends.

  • The wait_for module is not reliably waiting for the SSH service.

It detects the port open, probably open by systemd, but when the next task is run, SSH is still not ready.

If you're using Ansible 2.3+, wait_for_connection works reliably.

The best 'reboot and wait' in my experience (I am using Ansible 2.4) is the following:

- name: Reboot the machine
  shell: nohup bash -c "sleep 2s && shutdown -r now" &

- name: Wait for machine to come back
  wait_for_connection:
    timeout: 240
    delay: 20

I've got the nohup command from: https://github.com/keithchambers/microservices-playground/blob/master/playbooks/upgrade-packages.yml

I edited this message to:

  • add krad's portability suggestion, using shutdown -r now instead of reboot
  • add a delay. It is needed to avoid Ansible to execute the next step if the reboot is slow
  • increase the timeout, 120s was too little for some slow BIOS.
Telegrapher
  • 330
  • 4
  • 11
  • as per my previous comments its really bad to use reboot – krad Feb 06 '18 at 16:18
  • Sounds sensible in a non-Linux world. I've never found a linux where reboot wouldn't perform a proper organized reboot. – Telegrapher Feb 06 '18 at 16:39
  • Another advantage is that the shutdown command can include a timeout, without the need of using sleep, being cleaner. I like your suggestion of portability, but I'll test it before changing it here, just in case. The answer I posted is the simplest of all and works reliably in the meantime. – Telegrapher Feb 06 '18 at 16:45
  • I've had a look, and I don't like one thing from shutdown. The minimum granularity of the delay is 1 min, something that is wasteful, so we can't stop using the sleep. – Telegrapher Apr 03 '18 at 12:53
  • Just use 'shutdown now' with a sleep in front. The world is bigger than linux. – krad Apr 04 '18 at 12:06
  • I'm using the comments as thought process documentation. I would've expected that you would've had a look at the updated answer before your non-constructive comment. – Telegrapher Apr 05 '18 at 12:19
3

Yet another (combined from other answers) version:

---
- name: restart server
  command: /usr/bin/systemd-run --on-active=5 --timer-property=AccuracySec=100ms /usr/bin/systemctl reboot
  async: 0
  poll: 0
  ignore_errors: true
  become: yes

- name: wait for server {{ ansible_ssh_host | default(inventory_hostname) }} to come back online
  wait_for:
    port: 22
    state: started
    host: '{{ ansible_ssh_host | default(inventory_hostname) }}'
    delay: 30
  delegate_to: localhost
Andrzej Rehmann
  • 12,360
  • 7
  • 39
  • 38
3

Following solution works for me perfect:

- name: Restart machine
  shell: "sleep 5 && sudo shutdown -r now"
  async: 1
  poll: 0

- name: wait for ssh again available.
  wait_for_connection:
    connect_timeout: 20
    sleep: 5
    delay: 5
    timeout: 300

Sleep is required because ansible requires few second's to wrap up connection. Excelent post about this problem was written here: https://www.jeffgeerling.com/blog/2018/reboot-and-wait-reboot-complete-ansible-playbook

3

if you're using Ansible version >=2.7, you can use reboot module as described here

The synopsis of the reboot module itself:

Reboot a machine, wait for it to go down, come back up, and respond to commands.

In a simple way, you can define a simple task like this:

    - name: reboot server
      reboot:

But you can add some params like test_command to test if your server is ready to take further tasks

    - name: reboot server
      reboot:
        test_command: whoami

Hope this helps!

kxu
  • 73
  • 4
2

I am using Ansible 2.5.3. Below code works with ease,

- name: Rebooting host
  shell: 'shutdown -r +1 "Reboot triggered by Ansible"'

- wait_for_connection:
    delay: 90
    timeout: 300

You can reboot immediately, then insert a delay if your machine takes a while to go down:

    - name: Rebooting host
      shell: 'shutdown -r now "Reboot triggered by Ansible"'
      async: 1
      poll: 1
      ignore_errors: true

# Wait 120 seconds to make sure the machine won't connect immediately in the next section.
    - name: Delay for the host to go down
      local_action: shell /bin/sleep 120

Then poll to make the playbook return as soon as possible:

    - name: Wait for the server to finish rebooting
      wait_for_connection:
        delay: 15
        sleep: 15
        timeout: 300

This will make the playbook return as soon as possible after the reboot.

Mike S
  • 1,235
  • 12
  • 19
Ashwin
  • 993
  • 1
  • 16
  • 41
  • Hello, I find that this solution may work, but it is sub-optimal. I specified the reasons in my answer's comment, those are: - shutdown -r +1 is safe and will always work, but +1 adds one minute of delay to the reboot, which may be undesirable. - shutdown -r now is not safe, since ssh may be killed before Ansible gets its answer back, returning a failure, hence the +1 or the nohup + sleep combination are needed. I would like to find a simpler solution like shutdown -r +1, but I don't want the 1 min delay. – Telegrapher Jul 09 '18 at 14:39
  • Seems like your nohup answer is just another way to skin the cat. You've given the shell the job of exiting cleanly, so you don't have to tell Ansible to ignore the error. You say potaytoh, I say potahtoe. – Mike S Jul 11 '18 at 22:50
  • Well, it is usually better to avoid triggering an error than ignoring errors in general. By ignoring all possible shutdown errors, you may be ignoring another error that should be noticed. – Telegrapher Sep 13 '18 at 18:12
1

At reboot time all ssh connections are closed. That's why the Ansible task fails. The ignore_errors: true or failed_when: false additions are no longer working as of Ansible 1.9.x because handling of ssh connections has changed and a closed connection now is a fatal error which can not be caught during play.

The only way I figured out how to do it is to run a local shell task which then starts a separate ssh connection, which then may fail.

- name: Rebooting
  delegate_to: localhost
  shell: ssh -S "none" {{ inventory_hostname }} sudo /usr/sbin/reboot"
  failed_when: false
  changed_when: true
udondan
  • 57,263
  • 20
  • 190
  • 175
  • Thanks for explanation, but your approach probably requires passwordless sudo (or did I miss something?), so I can't use it in production. – Domen Blenkuš Apr 30 '15 at 10:19