7

My java application gets installed onto on OpenSUSE 13.2 OS, and I'm using systemd for process control. (systemd version 210)

I would like to take advantage of the systemd watchdog functionality using systemd-notify. However, I notice the app restarting due to inconsistent timeouts from the watchdog.

With WatchdogSec=120, and the app configured to call systemd-notify every 60 seconds, I observe restarts every five to 20 minutes, on average.

here is the (slightly redacted) systemd unit file for the process:

# Cool systemd service
[Unit]
Description=Something Awesome
After=awesomeparent.service
Requires=awesomeparent.service

[Service]
Type=simple
WorkingDirectory=/opt/awesome
Environment="AWESOME_HOME=/opt/awesome" 
User=awesomeuser
Restart=always
WatchdogSec=120
NotifyAccess=all
ExecStart=/home/awesome/jre1.8.0_05/bin/java -jar awesome.jar

[Install]
WantedBy=multi-user.target

And here is the code for calling systemd-notify

String pidStr = ManagementFactory.getRuntimeMXBean().getName();
pidStr = pidStr.split("@")[0];

String cmd = "/usr/bin/systemd-notify";

Process process = new ProcessBuilder(cmd, 
                                    "MAINPID=" + pidStr, 
                                    "WATCHDOG=1").redirectErrorStream(true)
                                                 .start();

int exitCode = 0;
if ((exitCode = process.waitFor()) != 0) {                
    String output = IOUtils.toString(process.getInputStream());
    Log.MAIN_LOG.error("Failed to notify systemd: " + 
                              ((output.isEmpty()) ? "" : " " + output) +
                              " Exit code: " + exitCode);

}

In the logs, I never see the failure messages (process always returns 0 exit code) and I'm 100% sure that the task IS being executed once per minute, on the minute. I can see the task log being executed immediately before restarts.

Anyone have any ideas why systemd-notify just doesn't work sometimes?

I'm thinking about writing code to call sd_pid_notify directly, but would like to know if there's a simple config thing I can do before going that route.

Kyle Fransham
  • 1,859
  • 1
  • 19
  • 22
  • Have you tried to use JNI call to [sd_notify(3)](http://www.freedesktop.org/software/systemd/man/sd_notify.html)? Thus you could check the status of the call more accurately. I suppose there're some problems with the call in-between the Java daemon and systemd. Also I'd put a message to log immediately before `ProcessBuilder.start()` and use a logging shell wrapper over `systemd-notify` just to make sure that the invocation of the subprocess is executed just in time and there're no any unpredictable delays – user3159253 Nov 27 '15 at 23:37
  • I have a similar issue with CentOS7.0 (systemd 208). I have the same 2 minute watchdog time, and it failed (seemingly randomly) today. In my case, I call `sd_notify()` directly once a second. I don't have any indication that the process sending notifications was halted at all. – Mark Lakata Apr 27 '16 at 22:28
  • I ended up using JNA for this, and it's been rock-solid ever since. I'll post the code in an answer below. – Kyle Fransham Apr 29 '16 at 18:27

2 Answers2

9

Here's the JNA code that solved the problem:

import com.sun.jna.Library;
import com.sun.jna.Native;

/**
 * The task issues a notification to the systemd watchdog. The systemd watchdog
 * will restart the service if the notification is not received.
 */

public class WatchdogNotifierTask implements Runnable {

private static final String SYSTEMD_SO = "systemd";
private static final String WATCHDOG_READY = "WATCHDOG=1";

@Override
public void run() {

  try {
    int returnCode = SystemD.INSTANCE.sd_notify(0, WATCHDOG_READY);
    if (returnCode < 0) {
      Log.MAIN_LOG.error(
          "Systemd watchdog returned a negative error code: " + Integer.toString(returnCode));
    } else {
      Log.MAIN_LOG.debug("Successfully updated systemd watchdog.");
    }
  } catch (Exception e) {
    Log.MAIN_LOG.error("calling sd_notify native code failed with exception: ", e);
  }
} 

/**
 * This is a linux-specific interface to load the systemd shared library and call the sd_notify
 * function. Should we need other systemd functionality, it can be loaded here. It uses JNA for
 * native library calls.
 *
 */
interface SystemD extends Library {
  SystemD INSTANCE = (SystemD) Native.loadLibrary(SYSTEMD_SO, SystemD.class);
  int sd_notify(int unset_environment, String state);
}

}
Kyle Fransham
  • 1,859
  • 1
  • 19
  • 22
  • 4
    This was super useful, I extended it a little bit: https://gist.github.com/juur/048cc3d0554953b717e9c6867970f30e – Ian Nov 06 '18 at 18:17
  • So as of Java 16, Unix domain sockets are supported natively by java through SocketChannel and ServerSocketChannel API. Any chance someone would be willing to post an answer that doesn't rely on JNA and uses those instead? – Mohamed Hafez Feb 12 '23 at 17:59
8

Anyone have any ideas why systemd-notify just doesn't work sometimes?

This is actually a long-standing problem in several systemd protocols, not just in the readiness notification protocol spoken by systemd-notify. The protocol for sending things directly to systemd's own journal also has this problem.

Both protocols attempt to find out stuff about the sending, client-end, process by reading things out of /proc/client-process-id/*. Unfortunately, systemd-notify is a short-lived program that exits as soon as it has sent the message to the server. So reading /proc/client-process-id/* does not yield the information about the client end that the server needs. In particular, the server cannot determine what (systemd) control group the client-end belongs to, and thus determine what service unit controls it, and thus determine whether it is a process that is allowed to send readiness notification messages.

As you have discovered, calling a library routine in-process in your actual dæmon, instead of forking a short-lived child process to run systemd-notify avoids this problem, because of course your dæmon does not immediately exit after sending the notification. Be aware, however, that if you issue a readiness notification immediately before exiting your daemon (as, ironically, some dæmons do in order to notify the world that they are terminating), you'll encounter this same problem even with an in-process library function.

There's no need to call a systemd library function as native code in order to speak this protocol, by the way. (And not using the library function gains you the advantage of speaking this protocol properly even if systemd isn't at the server end of it — a failing of the systemd library function.) It's not a hard protocol to speak in Java, and the systemd manual page describes the protocol. You look at an environment variable, open a datagram socket, use the variable's value for the name of the socket to send to, send a single datagram message, and then close the socket. Java is capable of this. ☺

Further reading

Community
  • 1
  • 1
JdeBP
  • 2,127
  • 16
  • 24
  • 1
    _You look at an environment variable, open a datagram socket, use the variable's value for the name of the socket to send to, send a single datagram message, and then close the socket. Java is capable of this._ Sounds like a good open-source library to write! – Kyle Fransham May 03 '16 at 16:38
  • 4
    https://github.com/faljse/SDNotify, I quote: "The Notify protocol uses datagram unix sockets, which are not accessible via Java; Therefore SDNotify includes a JNA wrapper of the socket API." So Java is not capable of this alone. :'( – zenbeni Apr 11 '17 at 15:06
  • 1
    @zenbeni https://bugs.openjdk.org/browse/JDK-8297837 – Eng.Fouad Dec 01 '22 at 23:09