Chef Client Not Resuming After Restart

Question

I ran the following recipe

O:\chef\cookbooks\wincfg>chef-client -L C:\chef\rds_deployment.log -l info -z -o wincfg::rds_deployment

The server reboots as expected after installing a Windows feature

I see the last lines of my log file say:

[2016-04-17T01:43:51+00:00] INFO: powershell_script[Desktop-Experience] ran successfully
[2016-04-17T01:43:51+00:00] INFO: powershell_script[Desktop-Experience] sending reboot_now action to reboot[reboot] (immediate)
[2016-04-17T01:43:51+00:00] INFO: Processing reboot[reboot] action reboot_now (wincfg::rds_deployment line 6)
[2016-04-17T01:43:51+00:00] WARN: Rebooting system immediately, requested by 'reboot'
[2016-04-17T01:43:51+00:00] INFO: Changing reboot status from {} to {:delay_mins=>0, :reason=>"There is a pending reboot.", :timestamp=>2016-04-17 01:43:51 +0000, :requested_by=>"reboot"}
[2016-04-17T01:43:51+00:00] WARN: Skipping final node save because override_runlist was given
[2016-04-17T01:43:51+00:00] INFO: Chef Run complete in 90.479509 seconds
[2016-04-17T01:43:51+00:00] INFO: Skipping removal of unused files from the cache
[2016-04-17T01:43:51+00:00] INFO: Running report handlers
[2016-04-17T01:43:51+00:00] INFO: Report handlers complete
[2016-04-17T01:43:51+00:00] WARN: Rebooting server at a recipe's request. Details: {:delay_mins=>0, :reason=>"There is a pending reboot.", :timestamp=>2016-04-17 01:43:51 +0000, :requested_by=>"reboot"}

The part of the recipe in question is:

reboot "reboot" do
  action :nothing
  reason 'There is a pending reboot.'
  only_if { reboot_pending? }
end

%w{ Desktop-Experience 
  Remote-Desktop-Services 
  RDS-RD-Server 
  RDS-Connection-Broker 
  RDS-Web-Access 
  RDS-Licensing 
  RDS-Gateway }.each do |feature|
  powershell_script "#{feature}" do
    code <<-EOH
    Import-Module ServerManager
    Add-WindowsFeature #{feature}
    EOH
    not_if "Import-Module ServerManager; (Get-WindowsFeature -Name #{feature}).Installed -eq $true"
    notifies :reboot_now, 'reboot[reboot]', :immediately
  end
end

I would expect for each of the features in the recipe, it would install using Add-WindowsFeature, if not already installed, then reboot immediately if reboot_pending is true.

It seems that the reboot is happening, but then the recipe isn't picking up with the next feature (after Desktop-Experience).

UPDATE: Here is how I'm installing Chef (on a brand new out of the box EC2 image running Server 2012 R2 Base), the Chef Windows service, and the Chef DK:

powershell -NoProfile -ExecutionPolicy Bypass ". { iwr -useb https://omnitruck.chef.io/install.ps1 } | iex; install; cd C:\opscode\chef\bin\; cmd /c chef-service-manager -a install; cmd /c chef-service-manager -a start"

powershell -NoProfile -ExecutionPolicy Bypass ". { iwr -useb https://omnitruck.chef.io/install.ps1 } | iex; install -project chefdk"

Immediately after install, I run

net use O: \\fileserver\share
O:
cd chef\cookbooks\wincfg
berks vendor ..\..\cookbooks
chef-client -L C:\chef\rds_deployment.log -l info -z -o wincfg::rds_deployment

UPDATE 2:

I saw [2016-04-17T01:43:51+00:00] WARN: Skipping final node save because override_runlist was given

in the logs...so instead of specifying the run list with -o, I am now specifying it with -r. This warning no longer appears in the logs (and I see a TON more info in nodes\thehost.json)...but it still doesn't resume after reboots correctly :(

I do see the following in the Application Event Viewer following restart:

Failed Chef Client run UNKNOWN in UNKNOWN seconds.
 Exception type: Chef::Exceptions::PrivateKeyMissing
 Exception message: I cannot read C:\chef\validation.pem, which you told me to use to sign requests!
 Exception backtrace: C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/http/authenticator.rb:86:in `rescue in load_signing_key'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/http/authenticator.rb:76:in `load_signing_key'

I love a good adventure through (lack of) documentation.

I ALMOST got it working

making sure the chef_repo path is available at all times (not a network drive)
making a client.rb file in C:\chef\ that indicated to run the chef-client always in zero client mode (not just when manually invoked by me from the command line)

So, my new artifacts look like

C:\chef\client.rb

log_level :info
log_location 'C:\chef\client.log'
chef_server_url 'https://localhost:4000'
validation_client_name 'chef-validator'
chef_zero.enabled true
chef_zero.port 4000
local_mode true
cookbook_path ['C:\chef_repo\cookbooks']

\ops01\ops\chef\bootstrap.bat:

mklink C:\chef_repo %~dp0 /d
powershell -NoProfile -ExecutionPolicy Bypass ". { iwr -useb https://omnitruck.chef.io/install.ps1 } | iex; install"
C:
cd \opscode\chef\bin\
copy %~dp0client.rb C:\chef\ /y
call chef-service-manager -a install
call chef-service-manager -a start

key parts are bootstrapping the client.rb and making sure the link is available at all times since the client.rb doesn't support unc/smb paths.

The chef-client Windows service now seems to automatically pick up runs correctly on reboots....BUT when it does, it doesn't trigger the reboot itself. Instead it logs

[2016-04-18T02:38:24+00:00] INFO: Changing reboot status from {} to {:delay_mins=>0, :reason=>"There is a pending reboot for \#{pack}.", :timestamp=>2016-04-18 02:38:24 +0000, :requested_by=>"googlechrome_reboot"}
[2016-04-18T02:38:24+00:00] INFO: HTTP Request Returned 500 Internal Server Error: error
[2016-04-18T02:38:24+00:00] ERROR: Running exception handlers
[2016-04-18T02:38:24+00:00] ERROR: Exception handlers complete
[2016-04-18T02:38:24+00:00] FATAL: Stacktrace dumped to c:/chef/local-mode-cache/cache/chef-stacktrace.out
[2016-04-18T02:38:24+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2016-04-18T02:38:24+00:00] FATAL: Net::HTTPFatalError: 500 "Internal Server Error"
[2016-04-18T02:38:37+00:00] INFO: Child process exited (pid: 692)
[2016-04-18T02:38:38+00:00] INFO: Next chef-client run will happen in 1800.8035677517687 seconds

so...it looks like the zero client server is returning an http 500 error. The Event Viewer application log shows:

Failed Chef Client run af972109-32ca-4089-97ef-789b7b5d8d07 in 133.762612 seconds.
 Exception type: Net::HTTPFatalError
 Exception message: 500 "Internal Server Error"
 Exception backtrace: C:/opscode/chef/embedded/lib/ruby/2.1.0/net/http/response.rb:119:in `error!'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/http.rb:146:in `request'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/http.rb:119:in `put'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/node.rb:620:in `save'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/client.rb:542:in `save_updated_node'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/client.rb:704:in `converge_and_save'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/client.rb:281:in `run'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/application.rb:267:in `run_with_graceful_exit_option'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/application.rb:243:in `block in run_chef_client'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/local_mode.rb:44:in `with_server_connectivity'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/application.rb:226:in `run_chef_client'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/application/client.rb:419:in `run_application'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/lib/chef/application.rb:58:in `run'
C:/opscode/chef/embedded/lib/ruby/gems/2.1.0/gems/chef-12.9.38-universal-mingw32/bin/chef-client:26:in `<top (required)>'
C:/opscode/chef/bin/chef-client:61:in `load'
C:/opscode/chef/bin/chef-client:61:in `<main>'

which doesn't really indicate anything to me...

But if I go to the command line and just run chef-client (from any directory, with no parameters, it immediately recognizes the need to reboot and does so).

Any ideas to finish out this problem? Would REALLY appreciate it.

score 1 · Accepted Answer · answered Apr 17 '16 at 05:52

1

Unless you set something up where Chef runs as a service or via a scheduled task, it can't just end up running again on its own after a restart. Also Chef doesn't per se "pick up where it left off", but it is normally idempotent and only changes things that need to be changed. The not_if guard on your resource is the idempotence check for each thing. Is there a reason you aren't using the windows_feature resource?

answered Apr 17 '16 at 05:52

coderanger

52,400
4
52
75

I have chef running as a Windows service. I'm not using the windows_feature resource because it seems unnecessary and more verbose....and does the same thing as my loop, no? – Jeff Apr 17 '16 at 14:07
If chef is running as a service, why isn't it trying to re-run the recipe on next boot? – Jeff Apr 17 '16 at 14:08
Check your logs? Both the event log and whatever Chef logs you have configured. How did you install the service? – coderanger Apr 17 '16 at 20:57
as I posed - the chef log file I specified in the original chef-client -z -o command line stops at the reboot and doesn't continue. I just updated my answer to include my installation method. Application log in Event Viewer doesn't indicate anything. – Jeff Apr 17 '16 at 21:57
What tells chef (like what file) what run was in progress that it needs to "restart" when the computer restarts? – Jeff Apr 17 '16 at 22:13
Nothing, if you set it up as a service it should start when Windows starts and it runs a new converge from the top. Thanks to idempotence, everything before that _should_ be a no-op, but that depends on how you write your recipe code. – coderanger Apr 18 '16 at 17:50
From the logs, it looks like the reboot request is causing some http 500 error in zero client mode when running as a windows service and then the run aborts and it doesn't reboot. If I leave the node sitting for 30 minutes...then the node reboots by itself. Any ideas? – Jeff Apr 18 '16 at 22:51
My windows-fu is weak but I would assume the security context you are running it as doesn't have write access to the working directory being used to store node data files. – coderanger Apr 18 '16 at 22:55
It must though...because when it reboots, it does resume the configuration for the node all by itself when the service starts. – Jeff Apr 18 '16 at 23:46
I wonder if the reboot is racing with the zero client web server and is causing the zero client server to bomb...? – Jeff Apr 18 '16 at 23:47
The server is started internally. That said, using local-mode in a service is weird and probably untested. – coderanger Apr 19 '16 at 02:11
So what would your recommendation be if I want to run chef in local mode (no mgmt server) as a way of bootstrapping and configuring my new machines - basically, they install chef on creation, set the client.rb as I have above, and then should proceed with the runlist, doing as many reboots as necessary in the process? If not with the windows service picking up where the reboot left off...then what? – Jeff Apr 19 '16 at 02:58
I wouldn't recommend doing that. Use something external to orchestrate reboots or just deal with running a Chef Server. – coderanger Apr 19 '16 at 03:13
I'm developing and testing here. A chef server is something down the pipeline, but it shouldn't be required to test recipes. I can tell that the chef zero server is getting halted due to a 500 error...is there a place I can find the log for the chef zero server? – Jeff Apr 19 '16 at 03:16
I'm trying to decide to use this or DSC. DSC is one of those half baked MS technologies, but it supports resuming without a problem and this is a must for developing – Jeff Apr 19 '16 at 03:17
Also - bear in mind that if I want to use chef for opscode, this will be a deal breaker if recipes require a reboot – Jeff Apr 19 '16 at 03:18
I don't know what `opscode` means in this context, but the better approach for restart handling would either be to do it manually or use a one-off scheduled task instead of a service, and make sure the task runs under your sid. This is well beyond SO comments though, handling reboots in any scripting system is going to be tricky. – coderanger Apr 19 '16 at 04:01
Sorry - meant opsworks...I'll try the scheduled task – Jeff Apr 19 '16 at 09:42
You were right - it was a security problem writing to the cookbooks/nodes directory when running as a service. Nice call! – Jeff Apr 19 '16 at 11:31

Chef Client Not Resuming After Restart

1 Answers1