11

There's a conditional debugging flag I miss from Matlab: dbstop if infnan described here. If set, this condition will stop code execution when an Inf or NaN is encountered (IIRC, Matlab doesn't have NAs).

How might I achieve this in R in a more efficient manner than testing all objects after every assignment operation?

At the moment, the only ways I see to do this are via hacks like the following:

  1. Manually insert a test after all places where these values might be encountered (e.g. a division, where division by 0 may occur). The testing would be to use is.finite(), described in this Q & A, on every element.
  2. Use body() to modify the code to call a separate function, after each operation or possibly just each assignment, which tests all of the objects (and possibly all objects in all environments).
  3. Modify R's source code (?!?)
  4. Attempt to use tracemem to identify those variables that have changed, and check only these for bad values.
  5. (New - see note 2) Use some kind of call handlers / callbacks to invoke a test function.

The 1st option is what I am doing at present. This is tedious, because I can't guarantee I've checked everything. The 2nd option will test everything, even if an object hasn't been updated. That is a massive waste of time. The 3rd option would involve modifying assignments of NA, NaN, and infinite values (+/- Inf), so that an error is produced. That seems like it's better left to R Core. The 4th option is like the 2nd - I'd need a call to a separate function listing all of the memory locations, just to ID those that have changed, and then check the values; I'm not even sure this will work for all objects, as a program may do an in-place modification, which seems like it would not invoke the duplicate function.

Is there a better approach that I'm missing? Maybe some clever tool by Mark Bravington, Luke Tierney, or something relatively basic - something akin to an options() parameter or a flag when compiling R?

Example code Here is some very simple example code to test with, incorporating the addTaskCallback function proposed by Josh O'Brien. The code isn't interrupted, but an error does occur in the first scenario, while no error occurs in the second case (i.e. badDiv(0,0,FALSE) doesn't abort). I'm still investigating callbacks, as this looks promising.

badDiv  <- function(x, y, flag){
    z = x / y
    if(flag == TRUE){
        return(z)
    } else {
        return(FALSE)
    }
}

addTaskCallback(stopOnNaNs)
badDiv(0, 0, TRUE)

addTaskCallback(stopOnNaNs)
badDiv(0, 0, FALSE)

Note 1. I'd be satisfied with a solution for standard R operations, though a lot of my calculations involve objects used via data.table or bigmemory (i.e. disk-based memory mapped matrices). These appear to have somewhat different memory behaviors than standard matrix and data.frame operations.

Note 2. The callbacks idea seems a bit more promising, as this doesn't require me to write functions that mutate R code, e.g. via the body() idea.

Note 3. I don't know whether or not there is some simple way to test the presence of non-finite values, e.g. meta information about objects that indexes where NAs, Infs, etc. are stored in the object, or if these are stored in place. So far, I've tried Simon Urbanek's inspect package, and have not found a way to divine the presence of non-numeric values.

Follow-up: Simon Urbanek has pointed out in a comment that such information is not available as meta information for objects.

Note 4. I'm still testing the ideas presented. Also, as suggested by Simon, testing for the presence of non-finite values should be fastest in C/C++; that should surpass even compiled R code, but I'm open to anything. For large datasets, e.g. on the order of 10-50GB, this should be a substantial savings over copying the data. One may get further improvements via use of multiple cores, but that's a bit more advanced.

Community
  • 1
  • 1
Iterator
  • 20,250
  • 12
  • 75
  • 111
  • Some of the primitive functions have this functionality built in, ie, they return an error or a warning if supplied or deliver a non-finite result. Take, `sin(Inf)` for example. Perhaps that's something that you could explore. – Brandon Bertelsen Feb 06 '12 at 21:40
  • Well, it's not always the case that an Inf or NaN *should* halt your function/code(NA is a separate case because it's deliberately used all the time as a 'filler' or 'marker') . I often run some operations which produce some Inf values, say in low-signal regions of some matrix. I suspect you'll get better info by using `is.infinite` and/or `is.nan` on suspect variables anyway. – Carl Witthoft Feb 06 '12 at 22:49
  • @CarlWitthoft In the code + data scenario I'm currently working on, the problem values are precisely those three - NA, NaN, and Infs. In other cases, I definitely need NAs, but not today. :) I need the code to abort (it's rather computationally expensive, b/c of the data volume) as soon as these occur. Hence, the reason I'd really like to trigger an error (or at least a warning). – Iterator Feb 06 '12 at 22:59
  • @BrandonBertelsen Thanks for the suggestion. I took a look at the basic arithmetic operators, and nothing popped out from inside R. I suspect that this special handling is in the C code for certain primitive functions, but then there's always the possibility that it's one of those things known only to R wizards. :) – Iterator Feb 06 '12 at 23:01
  • @Iterator -- Thanks for the thought-provoking question. It gave me just the sort of mind-bending workout that keeps me coming back to SO. Please let me know if my suggestion below ends up helping with your particular situation! – Josh O'Brien Feb 06 '12 at 23:41

2 Answers2

7

I fear there is no such shortcut. In theory on unix there is SIGFPE that you could trap on, but in practice

  1. there is no standard way to enable FP operations to trap it (even C99 doesn't include a provision for that) - it is highly system-specifc (e.g. feenableexcept on Linux, fp_enable_all on AIX etc.) or requires the use of assembler for your target CPU
  2. FP operations are nowadays often done in vector units like SSE so you can't be even sure that FPU is involved and
  3. R intercepts some operations on things like NaNs, NAs and handles them separately so they won't make it to the FP code

That said, you could hack yourself an R that will catch some exceptions for your platform and CPU if you tried hard enough (disable SSE etc.). It is not something we would consider building into R, but for a special purpose it may be doable.

However, it would still not catch NaN/NA operations unless you change R internal code. In addition, you would have to check every single package you are using since they may be using FP operations in their C code and may also handle NA/NaN separately.

If you are only worried about things like division by zero or over/underflows, the above will work and is probably the closest to something like a solution.

Just checking your results may not be very reliable, because you don't know whether a result is based on some intermediate NaN calculation that changed an aggregated value which may not need to be NaN as well. If you are willing to discard such case, then you could simply walk recursively through your result objects or the workspace. That should not be extremely inefficient, because you only need to worry about REALSXP and not anything else (unless you don't like NAs either - then you'd have more work).


This is an example code that could be used to traverse R object recursively:

static int do_isFinite(SEXP x) {
    /* recurse into generic vectors (lists) */
    if (TYPEOF(x) == VECSXP) {
        int n = LENGTH(x);
        for (int i = 0; i < n; i++)
            if (!do_isFinite(VECTOR_ELT(x, i))) return 0;
    }
    /* recurse into pairlists */ 
    if (TYPEOF(x) == LISTSXP) {
         while (x != R_NilValue) {
             if (!do_isFinite(CAR(x))) return 0;
             x = CDR(x);
         }
         return 1;
    }
    /* I wouldn't bother with attributes except for S4
       where attributes are slots */
    if (IS_S4_OBJECT(x) && !do_isFinite(ATTRIB(x))) return 0;
    /* check reals */
    if (TYPEOF(x) == REALSXP) {
        int n = LENGTH(x);
        double *d = REAL(x);
        for (int i = 0; i < n; i++) if (!R_finite(d[i])) return 0;
    }
    return 1; 
}

SEXP isFinite(SEXP x) { return ScalarLogical(do_isFinite(x)); }

# in R: .Call("isFinite", x)
Simon Urbanek
  • 13,842
  • 45
  • 45
  • Darn it, I was waiting for the clouds to part, the angels to sing, and for you to post. Once I read "I fear ...", I thought "Oh yeah, wait until Simon shows up, this guy is so wrong...let's see who this is..." :) As for why I don't post on R-Devel - R-Core is scary. :) – Iterator Feb 06 '12 at 23:30
  • More seriously, though, I have been looking into callbacks, condition handlers, and your `inspect` package. 1: Is it the case that the internal structure of the object won't reveal whether or not there are Infs or NAs, for FPs? I.e. is there any meta-information about the presence/location of non-finite values? 2: If there is such info, could I use the call handlers to invoke a call for checking bad values after the execution of each statement? – Iterator Feb 06 '12 at 23:36
  • Do I look scary? ;) Sorry to disappoint you - IMHO `SIGFPE` is really the way to go (I suspect that's what Matlab uses) but the lack of standards is really frustrating (and Matlab doesn't need to special-case NAs). – Simon Urbanek Feb 07 '12 at 00:23
  • But, no, there is no way to tell that a vector contains Inf/NaN/NA other than looking at its contents. Part of the reason is that `REALSXP` is passed as a `double*` pointer to any `C` code so there are no accessor macros and hence no control over what values are written where (unlike `STRSXP` for example) - so you would have to check the whole vector each time to implement such bit. BTW `inspect` is now part of R, you call it as `.Internal(inspect(x))`. – Simon Urbanek Feb 07 '12 at 00:23
  • Regarding `SIGFPE` - you're probably right. Unfortunately, I couldn't get at Matlab's internals. ;-) I was hoping it was a .M file, so I could check, but oh well, opacity is one of many reasons I avoid Matlab. Regarding `inspect()` - Joshua Ulrich suggested the use of the separate package in [this answer](http://stackoverflow.com/a/7327426/805808). Does this mean the package is deprecated? – Iterator Feb 07 '12 at 00:43
  • I didn't realize people are still using it (I wrote it for myself ;)) - I can map the `inspect` function to the internal one for recent R versions... – Simon Urbanek Feb 07 '12 at 01:24
7

The idea sketched below (and its implementation) is very imperfect. I'm hesitant to even suggest it, but: (a) I think it's kind of interesting, even in all of its ugliness; and (b) I can think of situations where it would be useful. Given that it sounds like you are right now manually inserting a check after each computation, I'm hopeful that your situation is one of those.

Mine is a two-step hack. First, I define a function nanDetector() which is designed to detect NaNs in several of the object types that might be returned by your calculations. Then, it using addTaskCallback() to call the function nanDetector() on .Last.value after each top-level task/calculation is completed. When it finds an NaN in one of those returned values, it throws an error, which you can use to avoid any further computations.

Among its shortcomings:

  • If you do something like setting stop(error = recover), it's hard to tell where the error was triggered, since the error is always thrown from inside of stopOnNaNs().

  • When it throws an error, stopOnNaNs() is terminated before it can return TRUE. As a consequence, it is removed from the task list, and you'll need to reset with addTaskCallback(stopOnNaNs) it you want to use it again. (See the 'Arguments' section of ?addTaskCallback for more details).

Without further ado, here it is:


# Sketch of a function that tests for NaNs in several types of objects
nanDetector <- function(X) {
   # To examine data frames
   if(is.data.frame(X)) { 
       return(any(unlist(sapply(X, is.nan))))
   }
   # To examine vectors, matrices, or arrays
   if(is.numeric(X)) {
       return(any(is.nan(X)))
   }
   # To examine lists, including nested lists
   if(is.list(X)) {
       return(any(rapply(X, is.nan)))
   }
   return(FALSE)
}

# Set up the taskCallback
stopOnNaNs <- function(...) {
    if(nanDetector(.Last.value)) {stop("NaNs detected!\n")}
    return(TRUE)
}
addTaskCallback(stopOnNaNs)


# Try it out
j <- 1:00
y <- rnorm(99)
l <- list(a=1:4, b=list(j=1:4, k=NaN))
# Error in function (...)  : NaNs detected!

# Subsequent time consuming code that could be avoided if the
# error thrown above is used to stop its evaluation.
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Hot diggity dog, this has several interesting ideas along the lines I was just starting consider, i.e. using callbacks. Whether or not it will work in my code, I'll have to see, but it's instructional nonetheless. Thanks! – Iterator Feb 06 '12 at 23:45
  • 1
    FWIW it will be really inefficient to perform the checks in R code - it's trivial to do so in C since it's essentially just `if (TYPEOF(x) == REALSXP) { double *d = REAL(x); int n = LENGTH(x); for(int i = 0; i < n; i++) if (!R_finite(d[i])) return ScalarLogical(0); } return ScalarLogical(1);` and you can recurse more quickly (and without copying) into recursive objects. – Simon Urbanek Feb 07 '12 at 00:29
  • @SimonUrbanek Thanks for the C guidance - that should certainly be fast. Is this approach easy to develop to use via, say, the `inline` package? Or can I wish upon a star that it might appear in R? :) (I guess that's an R-devel question. ;-)) – Iterator Feb 07 '12 at 00:50
  • @SimonUrbanek -- That sounds very interesting, but I've got no idea how I'd implement it in practice. I've never programmed with C, but am not afraid to learn how... So, is the basic idea in your comment to put the above code in a text file together with the appropriate R-related headers, compile it with gcc, and then load in the compiled code, calling it with `.C(myCnanDetector)` in place of the calls to the R function `nanDetector()`? – Josh O'Brien Feb 07 '12 at 00:51
  • @JoshO'Brien I'm still testing the function. It generally works, though it isn't always invoked. It seems that if the code is wrapped in `system.time({})` then the callback isn't executed. Not that that ultimately matters (I'll drop `system.time`), but it's worth noting for anyone else who tries this. – Iterator Feb 07 '12 at 01:33
  • Ah, I see the issue now. `addTaskCallback` is for top-level tasks. It seems that this won't stop execution inside of a function. I'll have to look at it a bit more. I've just added example code, which illustrates what happens - the second case isn't terminated. – Iterator Feb 07 '12 at 01:57