0

I am running this function to do n-fold cross-validation. The misclassification rate does not vary over folds, e.g. if I run 10 or 50. I am also getting a warning:

"Warning message:

'newdata' had 19 rows but variables found have 189 rows"

If I run the code without being part of a function, it is doing want I want -> e.g. for folds==1, it is pulling out 10%, running the model on 90% of the data, and predicting the other 10%. Does anyone have any ideas as to why it is not showing variation by variable and the number of folds?

library("MASS")  
data(birthwt)
data=birthwt

n.folds=10

jim = function(x,y,n.folds,data){

  for(i in 1:n.folds){
    folds <- cut(seq(1,nrow(data)),breaks=n.folds,labels=FALSE)      
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- data[testIndexes, ]
    trainData <- data[-testIndexes, ]
    glm.train <- glm(y ~ x, family = binomial, data=trainData)
    predictions=predict(glm.train, newdata =testData, type='response')
    pred.class=ifelse(predictions< 0, 0, 1)
    }

  rate=sum(pred.class!= y) / length(y)
  print(head(rate))
  }

jim(birthwt$smoke, birthwt$low, 10, birthwt)
Community
  • 1
  • 1
  • Thanks for that - the predictions should be (<0.5, 0,1). The function is still not right, but thanks for your observation. – user3504135 Nov 04 '16 at 14:49
  • I want pred.class as a vector that has all of the predictions from each fold. In this function, I'm just getting 19 back, when it should be 189. Then I produce rate using this vector of length 189. – user3504135 Nov 04 '16 at 14:55

1 Answers1

0

I am now making my comments into an answer.

jim <- function(x, y, n.folds, data) {   

  pred.class <- numeric(0)  ## initially empty; accumulated later
  for(i in 1:n.folds){
    folds <- cut(seq(1,nrow(data)), breaks = n.folds, labels = FALSE)  
    testIndexes <- which(folds == i)  ## no need for `arr.ind = TRUE`
    testData <- data[testIndexes, ]
    trainData <- data[-testIndexes, ]
    ## `reformulate` constructs formula from strings. Read `?reformulate`
    glm.train <- glm(reformulate(x, y), family = binomial, data = trainData)
    predictions <- predict(glm.train, newdata = testData, type = 'response')
    ## accumulate the result using `c()`
    ## change `predictions < 0` to `predictions < 0.5` as `type = response`
    pred.class <- c(pred.class, ifelse(predictions < 0.5, 0, 1))
    }

  ## to access a column with string, use `[[]]` not `$`
  rate <- sum(pred.class!= data[[y]]) / length(data[[y]])
  rate  ## or `return(rate)`
  }

jim("smoke", "low", 10, birthwt)
# [1] 0.3121693

Remark:

  1. No need to put arr.ind = TRUE here, although it has no side-effect.
  2. There is something wrong with your classification. You set type = "response", then you use ifelse(predictions < 0, 0, 1). Think about it, you always get 1 for pred.class.
  3. Each iteration of your for loop overwrites the pred.class. I think you want to accumulate the result. So do pred.class <- c(pred.class, ifelse(predictions < 0.5, 0, 1));
  4. Wrong use of glm and predict. It is wrong to put $ in model formula. Please read Predict() - Maybe I'm not understanding it. Here, I have changed your function to accept variable names (as a string), and use proper model formula inside glm. Note, this change requires to place y with data[[y]] in rate = sum(pred.class!= y) / length(y).
  5. You probably want to return rate rather than just printing it to screen. So replace your print line by explicit return(rate), or implicit rate.
  6. You can replace ifelse(predictions < 0.5, 0, 1) with as.integer(predictions < 0.5), although I did not change it in above.
Community
  • 1
  • 1
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • Thank you. The rate should be % of predictions that are not the same as y. Predictions should be a stack of the predictions from each prediction in the loop. I see now that each iteration overwrites pred.class. How can I return predictions and then compute rate? – user3504135 Nov 04 '16 at 15:16
  • Thanks for that. However, if you input other variables into jim("smoke", "low", 10, birthwt), e.g. "age", "low", or "race" "low", you still get 31%. Also, if you change n.folds to say 50, you still get 31%. That is where the issue is. Something is wrong. – user3504135 Nov 04 '16 at 15:31
  • I see now. Funny that for folds>10, the rate does not change. Thank you very much for that. I do greatly appreciate your time in helping me with this! Maybe some day I'll be able to contribute! – user3504135 Nov 04 '16 at 15:46