2

Summary

I have worked out a solution to the issue of this question.

Basically, the callee (wallpaper) was not itself exiting because it was waiting on another process to finish.

Over the course of 52 days, this problematic side effect had snowballed until 10,000+ lingering processes were consuming 10+ gigabytes of RAM, almost crashing my system.

The offending process turned out to be a call to printf from a function called log that I had sent into the background and forgotten about, because it was writing to a pipe and hanging.

As it turns out, a process writing to a named pipe will block until another process comes along and reads from it.

This, in turn, changed the requirements of the question from "I need a way to stop these processes from building up" to "I need a better way of getting around FIFO I/O than throwing it to the background".


Note that while the question has been solved, I'm more than happy to accept an answer that goes into detail on the technical level. For example, the unsolved mystery of why the caller script's (wallpaper-run) process was being duplicated as well, even though it was only called once, or how to read a pipe's state information proper, rather than relying on open's failure when called with O_NONBLOCK.

The original question follows.


The Question

I have two bash scripts meant to run in a loop. The first, wallpaper-run, runs in an infinite loop and calls the second, wallpaper.

They are part of my "desktop", which is a bunch of hacked together shell scripts augmenting the dwm window manager.

wallpaper-run:

log "starting wallpaper runner"

while true; do
    log "..."
    $scr/wallpaper
    sleep 900 # 15 minutes
done &

wallpaper:

log "changing wallpaper"

# several utility functions ...

if [[ $1 ]]; then
    parse_arg $1
else
    load_random
fi

Some notes:

  • log is an exported function from init, which, as its name suggests, logs a message.

  • init calls wallpaper-run (among other things) in its foreground (hence the while loop being in the background)

  • $scr is also defined by init; it is the directory where so-called "init-scripts" go

  • parse_arg and load_random are local to wallpaper

  • in particular, images are loaded into the background via the program feh

  • The manner in which wallpaper-run is loaded is as such: $mod/wallpaper-run

  • init is called directly by startx, and starts dwm before it runs wallpaper-run (and the other "modules")

Now on to the problem, which is that for some reason, both wallpaper-run and wallpaper "linger" in memory. That is to say that after each iteration of the loop, two new instances of wallpaper and wallpaper-run are created, while the "old" ones don't get cleaned up and get stuck in sleep status. It's like a memory leak, but with lingering processes instead of bad memory management.

I found out about this "process leak" after having my system up for 52 days when everything broke ( something like bash: cannot fork: resource temporarily unavailable spammed the terminal whenever I tried to run a command ) because the system ran out of memory. I had to kill over 10,000 instances of wallpaper/run to bring my system back to working order.

I have absolutely no idea why this is the case. I see no reason for these scripts to linger in memory because a script exiting should mean that its process gets cleaned up.

Why are they lingering and eating up resources?


Update 1

With some help from the comments (much thanks to I'L'I), I've traced the problem to the function log, which makes background calls to printf (though why I chose to do that, I don't recall). Here is the function as it appears in init:

log(){
    local pipe=$pipe_front
    if ! [[ -p $pipe ]]; then
        mkfifo $pipe
    fi
    printf ... >> $initlog
    printf ... > $pipe &
    printf ... &
    [[ $2 == "-g" ]] &&  notify-send "[DWM Init] $1"
    sleep 0.001
}

As you can see, the function is very poorly written. I hacked it together to make it work, not to make it robust.

The second and third printf are sent to the background. I don't recall why i did this, but it's presumably because the first printf must have been making log hang.

The printf lines have been abridged to "...", because they are fairly complex and not relevant to the issue at hand (And also I have better things to do with 40 minutes of my time than fighting with Android's garbage text input interface). In particular, things like the current time, name of the calling process, and the passed message are printed, depending on which printf we're talking about. The first has the most detail because it's saved to a file where immediate context is lost, while the notify-send line has the least amount of detail because it's going to be displayed on the desktop.

The whole pipe debacle is for interfacing directly with init via a rudimentary shell that I wrote for it.

The third printf is intentional; it prints to the tty that I log into at the beginning of a session. This is so that if init suddenly crashes on me, I can see a log of what went wrong. Or at least what was happening before it crashed

I'm including this in the question because this is the root cause of the "leak". If I can fix this function, the issue will be resolved.

The function needs to log the messages to their respective sources and halt until each call to printf finishes, but it also must finish within a timely manner; hanging for an indefinite period of time and/or failing to log the messages is unacceptable behavior.


Update 2

After isolating the log function (see update 1) into a test script and setting up a mock environment, I've boiled it down to printf.

The printf call which is redirected into a pipe,

printf "..." > $pipe

hangs if nothing is listening to it, because it's waiting for a second process to pick up the read end of the pipe and consume the data. This is probably why I had initially forced them into the background, so that a process could, at some point, read the data from the pipe while, in the immediate case, the system could move on and do other things.

The call to sleep, then, was a not-well-thought-out hack to work around data race problems resulting from one reader trying to read from multiple writers simultaneously. The theory was that if each writer had to wait for 0.001 seconds (despite the fact that the printf in the background has nothing to do with the sleep following it), somehow, that would make the data appear in order and fix the bug. Of course, looking back, that really does nothing useful.

The end result is several background processes hanging on to the pipe, waiting for something to read from it.

The answer to "Prevent hanging of "echo STRING > fifo" when nothing..." presents the same "solution" that caused the bug that spawned this question. Obviously incorrect. However, an interesting comment by user R.. mentioned something about fifos containing state which includes information such as what processes are reading the pipe.

Storing state? You mean the absence/presence of a reader? That's part of the state of the fifo; any attempt to store it outside would be bogus and would be subject to race conditions.

Obtaining this information and refusing to write if there is no reader is the key to solving this.

However, no matter what I search for on Google, I can't seem to find anything about reading the state of a pipe, even in C. I am perfectly willing to use C if need be, but a bash solution (or an existing core util) would be preferred.

So now the question becomes: how in the heck do I read the state information of a FIFO, particularly the process(es) who has (have) the pipe open for reading and/or writing?

Community
  • 1
  • 1
Braden Best
  • 8,830
  • 3
  • 31
  • 43
  • NB: I typed this from my phone, I really hope there aren't any autocorrect-induced spelling mistakes – Braden Best Jan 23 '17 at 18:58
  • You say that `log` calls `wallpaper-run`. Is that the case? – that other guy Jan 23 '17 at 19:04
  • @that I just noticed that ambiguity. Fixing it now – Braden Best Jan 23 '17 at 19:08
  • You say "log is an exported function from init, which calls wallpaper-run (among other things) in its foreground". Do you mean "log" calls your script, and your script calls "log"? – Fred Jan 23 '17 at 19:08
  • It sounds like `wallpaper-run` is being executed over and over. Is that the case? Can you tell whether this is the case from the log? If so, why is that happening? – that other guy Jan 23 '17 at 19:12
  • @Fred (and that other guy) fixed. Thanks for pointing it out. – Braden Best Jan 23 '17 at 19:12
  • @thatotherguy I don't believe it is. But I could be wrong if there are semantics I'm unaware of. wallpaper-run is called once and only once from init, which is called by startx as an impromptu xinitrc. wallpaper-run in turn calls wallpaper in an infinite loop. The loop is put in the background so init can call other "runners", and they aren't themselves backgrounded because, iirc, doing that caused some other desktop-breaking bug. – Braden Best Jan 23 '17 at 19:16
  • Re: "because a script exiting should mean that its process gets cleaned up." Not really, if that process has forked or spawned child processes, they'll still be there doing whatever it is they are doing (eg. x 10,000). – l'L'l Jan 23 '17 at 19:29
  • @l'L'l good to know. wallpaper-run spawns wallpaper, and wallpaper spawns feh. Both of which do the calling in the foreground (I.e. without `&`) and wait (implicitly) for said process to finish before continuing. The very fact that wallpaper-run manages to call wallpaper more than once tells me that the child processes are apparently exiting. – Braden Best Jan 23 '17 at 19:35
  • The behavior you describe has the symptoms of a race-condition, so I might suggest using `wait`, or checking for any child processes running. See my answer here: http://stackoverflow.com/questions/36364505/bash-cron-flock-screen/36366663#36366663 – l'L'l Jan 23 '17 at 19:42
  • @l'L'l Using wait in wallpaper ended up freezing the whole loop, so I tried commenting out all the log lines, considering that it makes background calls to printf (apparently printf is hanging, which must have been the reason I backgrounded it), and now the wallpaper/run issue is worked-around. I still don't understand why it duplicated the wallpaper-run process, but now i have to work out how to fix log. I never suspected log because it is a function, not a process. – Braden Best Jan 23 '17 at 20:35
  • @BradenBest are the "lingering processes" Zombie processes? e.g. are they marked with `Z` when doing `ps aux`? – hansaplast Jan 23 '17 at 20:41
  • @hansaplast they are marked with 'S' – Braden Best Jan 23 '17 at 21:09
  • @BradenBest would it be possible to bring your setup down to 1-2 scripts you could fully post above? It's hard to help without the full information (e.g. `parse_arg` and `load_random` and also the way `init` calls wallpaper-run is not clear to me) – hansaplast Jan 24 '17 at 05:23
  • @hansaplast under normal circumstances, I would include every bit of relevant information that I see fit. But these are not normal circumstances. Let's just say that this is the first and last time that I'll type program code on a cell phone. Ugh. That said, please do read the updated question. I've included more information about the root cause of the problem: log. – Braden Best Jan 24 '17 at 07:20
  • 1
    Regarding Update 2: I'm not sure if I should change the name of the question, make a new one and link to it from the question, or leave the question as-is. My judgement tells me to leave it as-is since the answer that can address the question in update 2 would be the one that effectively solves the issue the question asks about. Apologies if that goes against what a more frequent user would do. – Braden Best Jan 24 '17 at 09:02
  • @BradenBest this question was an interesting ride! I came here a couple of times to see how things develop. As you solved the issue IMO it would be best to add a "tl;dr" section at the start of the question for future readers and probably also change the title of the question? – hansaplast Jan 25 '17 at 09:41
  • 1
    @hansaplast I headed the question with a short summary and added the root cause of the issue to the question title – Braden Best Jan 25 '17 at 19:18

2 Answers2

1

https://stackoverflow.com/a/20694422

The above linked answer shows a C program attempting to open a file with O_NONBLOCK. So I tried writing a program whose job is to return 0 (success) if open returns a valid file descriptor, and 1 (fail) if open returns -1.

#include <fcntl.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
    int fd = open(argv[1], O_WRONLY | O_NONBLOCK);

    if(fd == -1)
        return 1;

    close(fd);
    return 0;
}

I didn't bother checking if argv[1] is null or if open failed because the file doesn't exist because I only plan to utilize this program from a shell script where it is guaranteed to be given the correct arguments.

That said, the program does its job

$ gcc pipe-open.c
$ ./a.out ./pipe && echo "pipe has a reader" || echo "pipe has no reader"
$ ./a.out ./pipe && echo "pipe has a reader" || echo "pipe has no reader"

Assuming the existence of pipe and that between the first and second invocations, another process opens the pipe (cat pipe), the output looks like this:

pipe has no reader

pipe has a reader

The program also works if the pipe has a second writer (I.e. it will fail because there is no reader)

The only problem is that after closing the file, the reader closes its end of the pipe as well. And removing the call to close won't do any good because all open file descriptors are automatically closed after main returns (control goes to exit, which walks the list of open file descriptors and closes them one by one). Not good!

This means that the only window to actually write to the pipe is before its closing, I.e. from within the C program itself.

#include <fcntl.h>
#include <unistd.h>

int
write_to_pipe(int fd)
{
    char buf[1024];
    ssize_t nread;
    int nsuccess = 0;

    while((nread = read(0, buf, 1024)) > 0 && ++nsuccess)
        write(fd, buf, nread);

    close(fd);
    return nsuccess > 0 ? 0 : 2;
}

int
main(int argc, char **argv)
{
    int fd = open(argv[1], O_WRONLY | O_NONBLOCK);

    if(fd == -1)
        return 1;

    return write_to_pipe(fd);
}

Invocation:

$ echo hello world | ./a.out pipe
$ ret=$?
$ if [[ $ret == 1 ]]; then echo no reader
> elif [[ $ret == 2 ]]; then echo an error occurred trying to write to the pipe
> else echo success
> fi

Output with same conditions as before (1st call has no reader; 2nd call does):

no reader

success

Additionally, the text "Hello World" can be seen in the terminal reading the pipe

And finally, the problem is solved. I have a program which acts as a middle man between a writer and a pipe, which exits immediately with a failure code if no reader is attached to the pipe at the time of invocation, or if there is, attempts to write to the pipe and communicates failure if nothing is written.

That last part is new. I thought it might be useful in the future to know if nothing got written.

I'll probably add more error detection in the future, but since log checks for the existence of the pipe before trying to write to it, this is fine for now

Community
  • 1
  • 1
Braden Best
  • 8,830
  • 3
  • 31
  • 43
0

The issue is that you are starting the wallpaper process without checking if the previous run finished or not. So, in 52 days, potentially 4 * 24 * 52 = ~5000 instances could be running (not sure how you found 10000, though)! Is it possible to use flock to make sure there is only one instance of wallpaper running at a time?

See this post: Quick-and-dirty way to ensure only one instance of a shell script is running at a time

Community
  • 1
  • 1
codeforester
  • 39,467
  • 16
  • 112
  • 140
  • The reason it produced 10,000 was because both wallpaper and wallpaper-run were being duplicated. I still have no idea why the latter was duplicated, but I've traced there problem to the log function, which I will be adding to the question in the next edit. – Braden Best Jan 23 '17 at 20:38
  • NB: I've made two big updates to the question since the time of your answer. – Braden Best Jan 24 '17 at 16:55