How does Solaris SMF determine if something is to be in maintenance or to be restarted?

Question

I have a daemon process that I wrote being executed by SMF. The problem is when an error occurs, I have fail code and then it will need to restart from scratch. Right now it is sending sys.exit(0) (Python), but SMF keeps throwing it in maintenance mode.

I've worked with SMF enough to know that it sometimes auto-restarts certain services (and lets others fail and have you deal with them like this). How do I classify this process as one that needs to auto-restart? Is it an SMF setting, a method of failing, what?

I found [this](http://unixtips.hpage.co.in/smf_73792393.html) page, which explains transient vs contract, but changing this doesn't help anything. — , Apr 17 '13 at 21:58

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

Manpage

Solaris uses a combination of startd/critical_failure_count and startd/critical_failure_period as described in the svc.startd manpage:

startd/critical_failure_count

startd/critical_failure_period

The critical_failure_count and critical_failure_period properties together specify the maximum number of service failures allowed in a given time interval before svc.startd transitions the service to maintenance. If the number of failures exceeds critical_failure_count in any period of critical_failure_period seconds, svc.startd will transition the service to maintenance.

Defaults in the source code

The defaults can be found in the source, the value depends on whether the service is "wait style":

if (instance_is_wait_style(inst))
    critical_failure_period = RINST_WT_SVC_FAILURE_RATE_NS;
else
    critical_failure_period = RINST_FAILURE_RATE_NS;

The defaults are either 5 failures/10 minutes or 5 failures/second:

#define RINST_START_TIMES   5       /* failures to consider */
#define RINST_FAILURE_RATE_NS   600000000000LL  /* 1 failure/10 minutes */
#define RINST_WT_SVC_FAILURE_RATE_NS    NANOSEC /* 1 failure/second */

These variables can be set in the SMF as properties:

<service_bundle type="manifest" name="npm2es">
  <service name="site/npm2es" type="service" version="1">
    ...
    <property_group name="startd" type="framework">
      <propval name='critical_failure_count' type='integer' value='10'/>
      <propval name='critical_failure_period' type='integer' value='30'/>
      <propval name="ignore_error" type="astring" value="core,signal" />
    </property_group>
    ...
  </service>
</service_bundle>

TL;DR

After checking against the startd values, If the service is "wait style", it will be throttled to a max restart of 1/sec, until it no longer exits with a non-cfg error. If the service is not "wait style" it will be put into maintenance mode.

I attempted to visualize this in a Google Spreadsheet but must be missing something about the algo: https://docs.google.com/spreadsheets/d/1oIy1fbEALK7GK63SI9IyUKVyYfmllP_Kz8jhdyTdF6I/edit?usp=sharing . Someone else feel free to take a crack at it. — doublerebel, Dec 17 '15 at 22:51
Quick feedback from a maintainer of OpenIndiana. illumos and Solaris have started to diverge, please beware. — Toasterson, Aug 18 '22 at 21:04

score 1 · Answer 2 · answered Apr 18 '13 at 18:22

Presuming a normal service manifest, I would suspect that you're dropping into maintenance because SMF is restarting you "too quickly" (which is a bit arbitrarily defined). svcs -xv should tell you if that is the case. If it is, SMF is restarting you, and then you're exiting again rapidly and it's decided to give up until the problem is fixed (and you've manually svcadm clear'd it.

I'd wondered if exiting 0 (and indicating success) may cause further confusion, but it doesn't appear that it will.

I don't think Oracle Solaris allows you to tune what SMF considers "too quickly".

I had the same theory as well, and some investigation using methods you mentioned helped resolve the issue. Because this post doesn't directly answer my question, I upvoted you and will accept my solution. Thanks for the help! — , Apr 18 '13 at 19:30

score 0 · Answer 3 · answered Apr 18 '13 at 03:06

0

You have to create a service manifest. This is more complicated than not. This has example manifests and documents the manifest structure.

http://www.oracle.com/technetwork/server-storage/solaris/solaris-smf-manifest-wp-167902.pdf

answered Apr 18 '13 at 03:06

jim mcnamara

16,005
2
34
51

I have already created a service manifest and imported it, hence the fact that I am going into maintenance mode through svcs. – Apr 18 '13 at 14:33

score 0 · Accepted Answer · answered Apr 18 '13 at 19:29

As it turns out, I had two pkills in a row to make sure everything was terminated correctly. The second one, naturally, was exiting something other than 0. Changing this to include an exit 0 at the end of the script solved the problem.

How does Solaris SMF determine if something is to be in maintenance or to be restarted?

4 Answers4

Manpage

startd/critical_failure_count

startd/critical_failure_period

Defaults in the source code

TL;DR