6

5 days and still no answer

  • As can be seen by Simon's comment, this is a reproducible and very strange issue. It seems that the issue only arises when a stepwise regression with very high predictive power is wrapped in a function.

I have been struggling with this for a while and any help would be much appreciated. I am trying to write a function that runs several stepwise regressions and outputs all of them to a list. However, R is having trouble reading the dataset that I specify in my function arguments. I found several similar errors on various boards (here, here, and here), however none of them seemed to ever get resolved. It all comes down to some weird issues with calling step() in a user-defined function. I am using the following script to test my code. Run the whole thing several times until an error arises (trust me, it will):

test.df <- data.frame(a = sample(0:1, 100, rep = T),
                      b = as.factor(sample(0:5, 100, rep = T)),
                      c = runif(100, 0, 100),
                      d = rnorm(100, 50, 50))
test.df$b[10:100] <- test.df$a[10:100] #making sure that at least one of the variables has some predictive power

stepModel <- function(modeling.formula, dataset, outfile = NULL) {
  if (is.null(outfile) == FALSE){
    sink(file = outfile,
         append = TRUE, type = "output")
    print("")
    print("Models run at:")
    print(Sys.time())
  }
  model.initial <- glm(modeling.formula,
                       family = binomial,
                       data = dataset)
  model.stepwise1 <- step(model.initial, direction = "backward")
  model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
  output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
  sink()
  return(output)
}

blah <- stepModel(a~., dataset = test.df)

This returns the following error message (if the error does not show up right away, keep re-running the test.df script as well as the call for stepModel(), it will show up eventually):

Error in is.data.frame(data) : object 'dataset' not found

I have determined that everything runs fine up until model.stepwise2 starts to get built. Somehow, the temporary object 'dataset' works just fine for the first stepwise regression, but fails to be recognized by the second. I found this by commenting out part of the function as can be seen below. This code will run fine, proving that the object 'dataset' was originally being recognized:

stepModel1 <- function(modeling.formula, dataset, outfile = NULL) {
  if (is.null(outfile) == FALSE){
    sink(file = outfile,
         append = TRUE, type = "output")
    print("")
    print("Models run at:")
    print(Sys.time())
  }
  model.initial <- glm(modeling.formula,
                       family = binomial,
                       data = dataset)
  model.stepwise1 <- step(model.initial, direction = "backward")
#   model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
#   sink()
#   output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
  return(model.stepwise1)
}

blah1 <- stepModel1(a~., dataset = test.df) 

EDIT - before anyone asks, all the summary() functions were there because the full function (i edited it so that you could focus in on the error) has another piece that defines a file to which you can output stepwise trace. I just got rid of them

EDIT 2 - session info

sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] tcltk     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sqldf_0.4-6.4         RSQLite.extfuns_0.0.1 RSQLite_0.11.3        chron_2.3-43         
 [5] gsubfn_0.6-5          proto_0.3-10          DBI_0.2-6             ggplot2_0.9.3.1      
 [9] caret_5.15-61         reshape2_1.2.2        lattice_0.20-6        foreach_1.4.0        
[13] cluster_1.14.2        plyr_1.8             

loaded via a namespace (and not attached):
 [1] codetools_0.2-8    colorspace_1.2-1   dichromat_2.0-0    digest_0.6.2       grid_2.15.1       
 [6] gtable_0.1.2       iterators_1.0.6    labeling_0.1       MASS_7.3-18        munsell_0.4       
[11] RColorBrewer_1.0-5 scales_0.2.3       stringr_0.6.2      tools_2.15

EDIT 3 - this performs all the same operations as the function, just without using a function. This will run fine every time, even when the algorithm doesn't converge:

modeling.formula <- a~.
dataset <- test.df
outfile <- NULL
if (is.null(outfile) == FALSE){
  sink(file = outfile,
       append = TRUE, type = "output")
  print("")
  print("Models run at:")
  print(Sys.time())
}
  model.initial <- glm(modeling.formula,
                       family = binomial,
                       data = dataset)
  model.stepwise1 <- step(model.initial, direction = "backward")
  model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
  output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
Community
  • 1
  • 1
zap2008
  • 684
  • 1
  • 9
  • 24
  • I tried to reproduce this error and couldn't do so. I'm running R 2.15.3 64-bit on Windows 7 and the function ran satisfactorily until it reached the call to `sink()`, at which point it reported `In sink() : no sink to remove`. Could you say what version of R you are using? – Simon May 17 '13 at 04:19
  • @Simon - i just put the outfile argument back into my function. this made the error reappear for me. I am adding system and session info to the question now. – zap2008 May 17 '13 at 04:32
  • Now this is extremely strange... The error is popping up seemingly at random. If I run my code 10 times it pops up maybe twice... – zap2008 May 17 '13 at 04:37
  • When I tried `blah <- stepModel(a~., data = test.df,"test.txt")`, I got no error at all and `blah` contained what I believe to be the expected output. – Simon May 17 '13 at 04:39
  • I ran `blah <- stepModel(a~., data = test.df,"test.txt")` some 20 times and got no error at all in any run. I'd suggest reinstalling R. – Simon May 17 '13 at 04:41
  • @Simon - i am reinstalling R now. just to humor me, try re-running the whole thing several times, including the creation of the random dataset `test.df` – zap2008 May 17 '13 at 04:42
  • 1
    Fascinating. When I tried re-running the whole thing several times, including creating the random dataset `test.df`, I got the `Error in is.data.frame(data) : object 'dataset' not found` message after the third, sixth, seventh and tenth time that I ran it. Also, most of those times, I also got warning messages to say that the glm.fit algorithm did not converge. My suspicion is that the content of the frame seems to be part of the problem. – Simon May 17 '13 at 05:04
  • You'd think so, but when you run it outside of the function, there is never an error. Even when the glm.fit algorithm doesn't converge. I am updating my question to include a block of code to run this without the function. – zap2008 May 17 '13 at 05:12
  • A point of order; the question you refer to on StackOverflow is solved, and in much the same way as here... – Aaron left Stack Overflow May 23 '13 at 11:49
  • Yes it was marked correct, but judging by the user's comments below your answer, it seemed like he was still having some issues. I could have just misread him however. I also tried using that method to no avail for the reasons I mentioned below. – zap2008 May 23 '13 at 14:06

1 Answers1

6

Using do.call to refer to the data set in the calling environment works for me. See https://stackoverflow.com/a/7668846/210673 for the original suggestion. Here's a version that works (with sink code removed).

stepModel2 <- function(modeling.formula, dataset) {
  model.initial <- do.call("glm", list(modeling.formula,
                       family = "binomial",
                       data = as.name(dataset)))
  model.stepwise1 <- step(model.initial, direction = "backward")
  model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
  list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
}

blah <- stepModel2(a~., dataset = "test.df")

It fails for me consistently with set.seed(6) with the original code. The reason it fails is that the dataset variable is not present within the step function, and although it's not needed in making model.stepwise1, it is needed for model.stepwise2 when model.stepwise1 keeps a linear term. So that's the case when your version fails. Calling the dataset from the global environment as I do here fixes this issue.

Community
  • 1
  • 1
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
  • so are you suggesting that I define an object called `dataset` in the global environment? That was my workaround originally, but it makes the function less usable. – zap2008 May 23 '13 at 01:20
  • No, this function refers to "test.df" in the original environment rather than "dataset" within the function. My text about it is perhaps unclear; I'll edit. You should be able to just load this function and run the code example at the bottom. – Aaron left Stack Overflow May 23 '13 at 01:31
  • This is great, thank you so much! I tried using `do.call` before I even posted this, as was suggested in the linked question, but I tried using `get()` instead of `as.name()`. How do the two differ? – zap2008 May 23 '13 at 14:05
  • `get("x")` retrieves the object with name `x`, `as.name("x")` merely tells R that `x` is not a string but the name of an object. – Aaron left Stack Overflow May 23 '13 at 19:08
  • Hmmm... I understand the difference, but I'm not sure I completely understand why those two would behave all that differently in a function. I'll look into it a little more on my own. Thanks! – zap2008 May 24 '13 at 04:07
  • If you want to pass the variable `dataset` to the function rather than the name `"dataset"`, you can use `data.name=deparse(substitute(dataset))` inside the function and pass `data.name` to the `do.call` – Duccio A Dec 13 '18 at 10:05