11

This question may or may not be inspired by my losing an entire 3-hour geocoding run because one of the values returned an error. Cue the pity (down)votes.

Basically there was an error returned inside a function called by sapply. I had options(error=recover) on, but despite browsing through every level available to me, I could not find any place where the results of the (thousands of successful) calls to FUN were stored in memory.

Some of the objects I found while browsing around themselves gave errors when I attempted to examine them, claiming the references were no longer valid. Unfortunately I lost the particular error message.

Here's a quick example which, while it does not replicate the reference error (which I suspect is related to disappearing environments and is probably immaterial), does demonstrate that I cannot see a way to save the data that has already been processed.

Is there such a technique?

Note that I have since realized my error and inserted even more robust error handling than existed before via try, but I am looking for a way to recover the contents ex post rather than ex ante.

Test function

sapply( seq(10), function(x) {
  if(x==5) stop("Error!")
  return( "important data" )
} )

Interactive exploration

> sapply( seq(10), function(x) {
+   if(x==5) stop("Error!")
+   return( "important data" )
+ } )
Error in FUN(1:10[[5L]], ...) : Error!

Enter a frame number, or 0 to exit   

1: sapply(seq(10), function(x) {
    if (x == 5) 
        stop("Error!")
    return("important data")
})
2: lapply(X = X, FUN = FUN, ...)
3: FUN(1:10[[5]], ...)

Selection: 3
Called from: FUN(1:10[[5L]], ...)
Browse[1]> ls()
[1] "x"
Browse[1]> x
[1] 5
Browse[1]> 
Enter a frame number, or 0 to exit   

1: sapply(seq(10), function(x) {
    if (x == 5) 
        stop("Error!")
    return("important data")
})
2: lapply(X = X, FUN = FUN, ...)
3: FUN(1:10[[5]], ...)

Selection: 2
Called from: lapply(X = X, FUN = FUN, ...)
Browse[1]> ls()
[1] "FUN" "X"  
Browse[1]> X
 [1]  1  2  3  4  5  6  7  8  9 10
Browse[1]> FUN
function(x) {
  if(x==5) stop("Error!")
  return( "important data" )
}
Browse[1]> 
Enter a frame number, or 0 to exit   

1: sapply(seq(10), function(x) {
    if (x == 5) 
        stop("Error!")
    return("important data")
})
2: lapply(X = X, FUN = FUN, ...)
3: FUN(1:10[[5]], ...)

Selection: 1
Called from: sapply(seq(10), function(x) {
    if (x == 5) 
        stop("Error!")
    return("important data")
})
Browse[1]> ls()
[1] "FUN"       "simplify"  "USE.NAMES" "X"        
Browse[1]> X
 [1]  1  2  3  4  5  6  7  8  9 10
Browse[1]> USE.NAMES
[1] TRUE
Browse[1]> simplify
[1] TRUE
Browse[1]> FUN
function(x) {
  if(x==5) stop("Error!")
  return( "important data" )
}
Browser[1]> Q

To be clear, what I was hoping to find was the vector:

[1] "important data" "important data" "important data" "important data"

In other words, the results of the internal loop that had been completed to this point.

Edit: Update with C code

Inside .Internal(lapply()) is the following code:

PROTECT(ans = allocVector(VECSXP, n));
...
for(i = 0; i < n; i++) {
   ...
   tmp = eval(R_fcall, rho);
   ...
   SET_VECTOR_ELT(ans, i, tmp);
}

I want to get at ans when any call to lapply fails.

Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • good question! I had a similar issue a while back and couldn't think of anything clever...so I added a line of code to write out an .RDATA file at each step, so I ended up with a directory with 100s of .RDATA files to read back in, but avoided precisely this problem. – Chase Oct 22 '12 at 01:15
  • Yeah. I implemented two fixes: [calls to `try`](https://github.com/gsk3/taRifx.geo/blob/master/R/Contributed.R#L102) and running the `sapply` in batches of 1000 then saving in between. But it doesn't help me now :-/ – Ari B. Friedman Oct 22 '12 at 01:31
  • 1
    my guess is that the only thing that can help you now is another 3 hours and a nice [beverage](http://rogue.com/beers/dead-guy-ale.php) – Chase Oct 22 '12 at 01:36
  • Unfortunately it's going to be another 24 hours since there's a 25k geocoding quota for Bing free accounts. – Ari B. Friedman Oct 22 '12 at 01:38
  • As a bit of an aside, don't make a large sapply. If the function in sapply is quite long you're probably doing something inefficiently. Remember that in R, ANY opportunity to do a vectorized operation should be taken. – John Oct 22 '12 at 01:40
  • @John I hear 'ya, but in this case the network latency is >> the loss due to not vectorizing everything. – Ari B. Friedman Oct 22 '12 at 01:45
  • 2
    As your question is about fixing this after the fact, these don't quite apply, but future searchers may be interested in similar questions which show how to avoid the problem. See http://stackoverflow.com/q/2589275/210673 and http://stackoverflow.com/q/1395622/210673; also see my solution, which involves a version of `try`, here: http://stackoverflow.com/q/4948361/210673 – Aaron left Stack Overflow Oct 22 '12 at 14:44

3 Answers3

4

I'm struggling to see why a try() here isn't the way to go? If the sapply() fails for whatever reason then you

  1. want to handle that failure well
  2. carry on from there

Why would you want the entire data analysis/processing step to stop just for an error? Which is what you seem to be proposing. Rather than try to recover what has already been done, write your code so that it just carries on, recording the error took place but also gracefully moving onto the next step in the process.

It is a bit convoluted because the example you give is contrived (if you knew what would cause an error you could handle that without a try()), but bear with me:

foo <- function(x) {
    res <- try({
        if(x==5) {
            stop("Error!")
        } else {
            "important data"
        }
    })
    if(inherits(res, "try-error"))
        res <- "error occurred"
    res
}

> sapply( seq(10), foo)
Error in try({ : Error!
 [1] "important data" "important data" "important data" "important data"
 [5] "error occurred" "important data" "important data" "important data"
 [9] "important data" "important data"

Having runs jobs that took weeks to finish on my workstation in the background, I quickly learned to write lots of try() calls around individual statements rather than big blocks of code so that once an error occurred I could quickly get out of that iteration/step with the least effect on the running job; in other words, if a particular R call failed I returned something that would slot into the object returned by sapply() (or whatever function) nicely.

For anything more complex, I would probably use lapply():

foo2 <- function(x) {
    res <- try({
        if(x==5) {
            stop("Error!")
        } else {
            lm(rnorm(10) ~ runif(10))
        }
    })
    if(inherits(res, "try-error"))
        res <- "error occurred"
    res
}

out <- lapply(seq(10), foo2)
str(out, max = 1)

because you are going to want the list rather than try to simplify more complex objects down to something simple:

>     out <- lapply(seq(10), foo2)
Error in try({ : Error!
> str(out, max = 1)
List of 10
 $ :List of 12
  ..- attr(*, "class")= chr "lm"
 $ :List of 12
  ..- attr(*, "class")= chr "lm"
 $ :List of 12
  ..- attr(*, "class")= chr "lm"
 $ :List of 12
  ..- attr(*, "class")= chr "lm"
 $ : chr "error occurred"
 $ :List of 12
  ..- attr(*, "class")= chr "lm"
 $ :List of 12
  ..- attr(*, "class")= chr "lm"
 $ :List of 12
  ..- attr(*, "class")= chr "lm"
 $ :List of 12
  ..- attr(*, "class")= chr "lm"
 $ :List of 12
  ..- attr(*, "class")= chr "lm"

That said, I'd probably have done this via a for() loop, filling in a preallocated list as I iterated.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • I totally agree, and in fact did [just that](https://github.com/gsk3/taRifx.geo/blob/master/R/Contributed.R#L102) minutes after it failed. Also re-implemented it as a for loop. But it didn't help with the work I'd already computed. I guess what I'm really asking is why it is left to the programmer to have something inside a `lapply` call fail gracefully, rather than `lapply` returning already computed work on an error. The answer may well be overhead, but given that it's already storing `ans` I don't see why the .Int lapply doesn't return what's already been computed. – Ari B. Friedman Oct 22 '12 at 11:56
  • 2
    Because `lapply()` was designed to run functions over a set of indices. It is quite general and to keep it that way (and to maintain efficiency) the user is expected to do any trapping of errors. `lapply()` is running R functions from C code. To trap errors there would likely be a pain, better to let R's usual error trapping code kick in - unfortunately once that happens you are outside C and the code never returns to C other than to gracefully back out reporting the error (and by gracefully I mean without anything done so far). – Gavin Simpson Oct 22 '12 at 12:01
1

You never assigned the intermediate values to anything. I don't understand why you think there should be any entrails to divine. You need to record the values somehow:

 res <- sapply( seq(10), function(x) { z <- x
                                   on.exit(res <<- x);
                                   if(x==5) stop("Error!")
 } )
Error in FUN(1:10[[5L]], ...) : Error!
 res
#[1] 5

This on.exit method is illustrated on the ?par page as a way of restoring par settings when plotting has gone wrong. (I was not able to get it to work with on.exit(res <- x).

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I don't want the last value of `x` though. I want every value that the function generated before the value that made it fail. `sapply` has to be storing it somewhere. In other words, I want `c("important data","important data","important data","important data")`, not `c(5)`. – Ari B. Friedman Oct 22 '12 at 01:34
  • I don't think `sapply` is storing it. You need to store it .... er, them. @TylerRinker's code illustrates one method of doing so. – IRTFM Oct 22 '12 at 02:28
  • 2
    If `sapply` (really, `lapply` which is called inside `sapply`) is not storing it, then how does it magically appear in cases where there's no error? It may not be stored in the R code, but somewhere in the `.Internal` version of `lapply` there's a vector stored with my results, goshdarnit :-) – Ari B. Friedman Oct 22 '12 at 03:21
  • 1
    @AriB.Friedman -- I think here and with your edit containing the C code, you've answered your own question, which is that you can't get at those intermediate results "from R" since they're being stored in a C variable named `ans`. I suppose if you had compiled your own R with C-level debugging enabled you might be able to recover on the error and have a look at `ans`, but I know close to nothing about how that'd work. – Josh O'Brien Oct 22 '12 at 11:22
1

Maybe I'm not understanding and this will certainly slow you down but what about a global assignment each time?

safety <- vector()
sapply( seq(10), function(x) {
  if(x==5) stop("Error!")
  assign('safety', c(safety, x), envir = .GlobalEnv)
  return( "important data" )
} )

Yields:

> safety <- vector()
> sapply( seq(10), function(x) {
+   if(x==5) stop("Error!")
+   assign('safety', c(safety, x), envir = .GlobalEnv)
+   return( "important data" )
+ } )
Error in FUN(1:10[[5L]], ...) : Error!
> safety
[1] 1 2 3 4
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519