6

I am trying to handle a SIGFPE signal but my program just crashes or runs forever. I HAVE to use signal() and not the other ones like sigaction().

So in my code I have:

#include <stdio.h>
#include <signal.h>

void handler(int signum)
{
    // Do stuff here then return to execution below
}

int main()
{
    signal(SIGFPE, handler);

    int i, j;
    for(i = 0; i < 10; i++) 
    {
        // Call signal handler for SIGFPE
        j = i / 0;
    }

    printf("After for loop");

    return 0;
}

Basically, I want to go into the handler every time there is a division by 0. It should do whatever it needs to inside the handler() function then continue the next iteration of the loop.

This should also work for other signals that need to be handled. Any help would be appreciated.

syy
  • 687
  • 2
  • 13
  • 32

2 Answers2

5

If you have to use signal to handle FPE or any other signal that you cause directly by invoking the CPU nonsense that causes it, it is only defined what happens if you either exit the program from the signal handler or use longjmp to get out.

Also note the exact placement of the restore functions, at the end of the computation branch but at the start of the handle branch.

Unfortunately, you can't use signal() like this at all; the second invocation causes the code to fall down. You must use sigaction if you intend to handle the signal more than once.

#include <stdio.h>
#include <signal.h>
#include <setjmp.h>
#include <string.h>

jmp_buf fpe;

void handler(int signum)
{
    // Do stuff here then return to execution below
    longjmp(fpe, 1);
}

int main()
{
    volatile int i, j;
    for(i = 0; i < 10; i++) 
    {
        // Call signal handler for SIGFPE
        struct sigaction act;
        struct sigaction oldact;
        memset(&act, 0, sizeof(act));
        act.sa_handler = handler;
        act.sa_flags = SA_NODEFER | SA_NOMASK;
        sigaction(SIGFPE, &act, &oldact);

        if (0 == setjmp(fpe))
        {
            j = i / 0;
            sigaction(SIGFPE, &oldact, &act);
        } else {
            sigaction(SIGFPE, &oldact, &act);
            /* handle SIGFPE */
        }
    }

    printf("After for loop");

    return 0;
}
Joshua
  • 40,822
  • 8
  • 72
  • 132
  • Hi, I get `error: unknown type name ‘sighandler_t’` when I run this. I use `gcc` to compile. – syy Sep 11 '16 at 01:13
  • @Flow; Check your compiler manual for signal; it seems to be nonstandard. – Joshua Sep 11 '16 at 01:41
  • I'm using `gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4` inside Ubuntu 14.04 server version. When I do `man signal`, I see `sighandler_t` in there. Not sure what else to do. – syy Sep 11 '16 at 01:44
  • OK I figured it out; looks like you're supposed to add the definition of sighandler_t yourself. – Joshua Sep 11 '16 at 01:49
  • 1
    Yep thanks! I had to add the definition myself. But now it never prints `After for loop`. I get `Floating point exception (core dumped)` printed out once then it exits. – syy Sep 11 '16 at 01:53
  • @CraigEstey: Possibly that simplifies matters, or possibly not. Signal masking is irrelevant here. – Joshua Sep 11 '16 at 03:42
  • Under POSIX rules (which apply when you're using [`sigaction()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/sigaction.html)) you have to explicitly use `SA_RESETHAND` to get the handler reset to `SIG_DFL` — so the reuse of `sigaction()` is not necessary. Even if it was, you're explicitly allowed to call `signal()` and `sigaction()` in a signal handler — see POSIX's [Signal Concepts](http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04). – Jonathan Leffler Sep 11 '16 at 03:43
  • Kinda figured, but wasn't sure. I've always used the `sig*` versions when I've implemented such a signal handler [for doing a stack traceback and terminate]. – Craig Estey Sep 11 '16 at 03:48
  • @JonathanLeffler: I'm not restoring SIG_DFL per se, I'm restoring whatever the calling function had already set up. – Joshua Sep 11 '16 at 04:10
  • j = i / 0; sigaction(SIGFPE, &oldact, &act); <<<<<<<<<< Why do you need it in if condition also ? After i/0 signal should get generated, go to handler, returning to begining of if condition. – sunil Aug 08 '18 at 05:41
  • My question is not about "if". Its about why sigaction() inside if condition. j=i/0 itself will cause the exception right, so next statement is not executed. – sunil Aug 09 '18 at 17:31
  • @sunil: Ah. In the question, whether or not the divisor was 0 depended on data input, so whether or not there would be a division by zero depended on input. Should the code flow through the end of the if branch, the signal handler needs to go away. – Joshua Aug 09 '18 at 17:36
5

Caveat: Sorry to rain on the parade, but you really don't want to do this.

It is perfectly valid to trap [externally generated] signals like SIGINT, SIGTERM, SIGHUP etc. to allow graceful cleanup and termination of a program that may have files open that are partially written to.

However, internally generated signals, such as SIGILL, SIGBUS, SIGSEGV and SIGFPE are very hard to recover from meaningfully. The first three are bugs--pure and simple. And, IMO, the SIGFPE is also a hard bug as well.

After such a signal, your program is in an unsafe and indeterminate state. Even trapping the signal and doing longjmp/siglongjmp doesn't fix this.

And, there is no way to tell exactly how bad the damage is. Or, how bad the damage will become if the program tries to proceed.

If you get SIGFPE, was it for a floating point calculation [which you might be able to smooth over]. Or, was it for integer divide-by-zero? What calculation was being done? And, where? You don't know.

Trying to continue can sometimes cause 10x the damage because now the program is out of control. After recovery, the program may be okay, but it may not be. So, the reliability of the program after the event, can not be determined with any degree of certainty.

What were the events (i.e.) calculations that led up to the SIGFPE? Maybe, it's not merely a single divide, but the chain of calculations that led up to the value being zero. Where did these values go? Will these now suspect values be used by code after the recovery operation has taken place?

For example, the program might overwrite the wrong file because the failed calculation was somehow involved in selecting the file descriptor that a caller is going to use.

Or, you leak memory. Or, corrupt the heap. Or, was the error within the heap allocation code itself?

Consider the following function:

void
myfunc(char *file)
{
    int fd;

    fd = open(file,O_WRONLY);

    while (1) {
        // do stuff ...

        // write to the file
        write(fd,buf,len);

        // do more stuff ...

        // generate SIGFPE ...
        x = y / z;
    }

    close(fd);
}

Even with a signal handler that does siglongjmp, the file that myfunc was writing to is now corrupted/truncated. And, the file descriptor won't be closed.

Or, what if myfunc was reading from the file and saving the data to some array. That array is only partially filled. Now, you get SIGFPE. This is intercepted by the signal handler which does siglongjmp.

One of the callers of myfunc does the sigsetjmp to "catch" this. But, what can it do? The caller has no idea how bad things are. It might assume that the buffer myfunc was reading into is fully formed and write it out to a different file. That other file has now become corrupted.


UPDATE:

Oops, forgot to mention undefined behavior ...

Normally, we associate UB, such as writing past the end of an array, with a segfault [SIGSEGV]. But, what if it causes SIGFPE instead?

It's no longer just a "bad calculation" -- we're trapping [and ignoring] UB at the earliest detection point. If we do recovery, the next usage could be worse.

Here's an example:

// assume these are ordered in memory as if they were part of the same struct:
int x[10];
int y;
int z;

void
myfunc(void)
{

    // initialize
    y = 23;
    z = 37;

    // do stuff ...

    // generate UB -- we run one past the end of x and zero out y
    for (int i = 0;  i <= 10;  ++i)
        x[i] = 0;

    // do more stuff ...

    // generate SIGFPE ...
    z /= y;

    // do stuff ...

    // do something _really_ bad with y that causes a segfault or _worse_
    // sends a space rocket off-course ...
}
Craig Estey
  • 30,627
  • 4
  • 24
  • 48
  • 2
    Very true — trying to recover from SIGILL, SIGBUS, SIGFPE, SIGSEGV is almost invariably fraught, and unreliable, and likely to lead to problems. SIGFPE, these days, is almost only raised for integer division by zero; floating point division by zero normally returns an infinity. – Jonathan Leffler Sep 11 '16 at 03:47
  • @JonathanLeffler Ironically, I've had to implement such a handler for a mission critical application. If a thread trapped something (e.g. `SIGSEGV`), it would signal a master thread. The master would signal all others. They would all do stack tracebacks to files and hold [being careful to assume nothing about program state]. When, all dumping completed, the master would terminate the app. A wrapper script/program would restart the program after sending the dump files to a server. But, the would _never_ try to "recover". – Craig Estey Sep 11 '16 at 04:00
  • I've seen similar code in a multi-threaded database server where the threading used multiple processes (so memory management was a perpetual problem — anything related to a thread had to be in shared memory), and where faults would terminate the process (thread) that encountered the problem after dumping core, then start a replacement process. That's pretty similar to what you describe, except that the threading was novel — at least, it was novel in 1994 before POSIX threads were widely available. These days, it mostly feels like a nuisance and an inhibitor because of the memory management. – Jonathan Leffler Sep 11 '16 at 04:04
  • I too would be writing this answer if there was even a *hint* he was trying to recover from something complex. – Joshua Sep 11 '16 at 04:11
  • @Joshua You were being kind to OP and his immediate goal. If you hadn't posted your answer [which I read before posting mine], I would probably have added the exact procedure to my answer. But, since you had already done a good job with that, I didn't want to duplicate your effort. Since I have a lot of practical experience with such recovery [and why it isn't a good idea], I decided to "take the low road" :-). To me, the "hint" was when OP said _"for other signals that need to be handled"_. I was afraid this meant semi-fatal signals like `SIGSEGV`. – Craig Estey Sep 11 '16 at 04:30
  • @JonathanLeffler Was it based on `GNU pth` https://www.gnu.org/software/pth/ ? The earliest changelog entry I could find was 1999, so it may not be the same. In my answer: http://stackoverflow.com/questions/39185134/how-are-user-level-threads-scheduled-created-and-how-are-kernel-level-threads-c/39185831#39185831 what I was calling `LWP` was actually GNU pth [I had been using it before NPTL, but couldn't remember the name] – Craig Estey Sep 11 '16 at 04:41
  • No; proprietary technology. – Jonathan Leffler Sep 11 '16 at 04:42
  • Downvoted for not actually answering the question. Warnings and caveats are fine but sometimes you really do want something that usually is undesirable. – Joseph Garvin Sep 12 '17 at 16:55
  • @JosephGarvin Your downvote doesn't really meet the guidelines for downvoting an answer (i.e. _egregiously_ wrong). And, I've posted a number of answers that are "don't do that" [as have many other responders] and they don't get downvoted. In most cases, OPs are grateful for those answers. Sometimes it _is_ okay to do something that is not normally advisable [I do it all the time if needed], but herein is _not_ one of them. And you _can't_ recover because you don't know which value you need to adjust to prevent further damage. – Craig Estey Sep 13 '17 at 01:27
  • 1
    @CraigEstey If the guidelines fail to say that answer should actually answer the question the guidelines are broken. The constant tendency of posters on stackoverflow to say that the questioner should not have the need they have is one of the most irritating aspects about it. If you have a minority use case you get a ton of posts like yours telling you you shouldn't need what you need, usually from people with no insight into your requirements. – Joseph Garvin Sep 13 '17 at 12:36
  • @JosephGarvin It is a matter of opinion as to whether a given answer answers the question. But, here, my answer _did_ answer the question because what OP is trying to do is _impossible_ in any meaningful way. I do hope you read the entire answer to see why. And, it's backed up by the fact that I've written code that had to field such signals for three mission critical [must _not_ fail] commercial products. – Craig Estey Sep 14 '17 at 00:42