0

I am trying to prune a decision tree to create 19 trees that have 2-20 terminal nodes, and I would like to calculate the training and test error for each. I used this code:

range <- c(2:20)

for (i in range) {
  prune.fit <- prune.tree(fit, best = i)
  
  plot(prune.fit) # all the plots :) 
  text(prune.fit, pretty = 0)
}

which worked well to generate the trees, but when I added in the training and test error it wouldn't work. I then tried this:

for (i in range) {
    pred.fittrain[i] <- predict(prune.fit[i], newdata = my_ahp_train)
    mean((pred.fittrain - my_ahp_train$sale_price)^2)
    
    pred.fittest[i] <- predict(prune.fit[i], newdata = my_ahp_test)
    mean((pred.fittest - my_ahp_test$sale_price)^2)
}

but it just gave me an error. I don't know how to fix this so that it calculates for each individual tree. If anyone has any tips please let me know!

For the Training and Test Error calculation I tried the following codes:

range <- c(2:20)

for (i in range) {
  prune.fit <- prune.tree(fit, best = i)
  
  plot(prune.fit) # all the plots :) 
  text(prune.fit, pretty = 0)

pred.fittrain[i] <- predict(prune.fit[i], newdata = my_ahp_train)
    mean((pred.fittrain - my_ahp_train$sale_price)^2)
    
    pred.fittest[i] <- predict(prune.fit[i], newdata = my_ahp_test)
    mean((pred.fittest - my_ahp_test$sale_price)^2)
}

AND

range <- c(2:20)

for (i in range) {
  prune.fit <- prune.tree(fit, best = i)
  
  plot(prune.fit) # all the plots :) 
  text(prune.fit, pretty = 0)

pred.fittrain <- predict(prune.fit, newdata = my_ahp_train)
    mean((pred.fittrain - my_ahp_train$sale_price)^2)
    
    pred.fittest <- predict(prune.fit, newdata = my_ahp_test)
    mean((pred.fittest - my_ahp_test$sale_price)^2)
}

AND

for (i in range) {
    pred.fittrain[i] <- predict(prune.fit[i], newdata = my_ahp_train)
    mean((pred.fittrain - my_ahp_train$sale_price)^2)
    
    pred.fittest[i] <- predict(prune.fit[i], newdata = my_ahp_test)
    mean((pred.fittest - my_ahp_test$sale_price)^2)
}

I was expecting one of these to generate the training and test errors for each decision tree.

Parfait
  • 104,375
  • 17
  • 94
  • 125
  • 1
    Please post error messages as *it wouldn't work* is not helpful. Also, what happens in your last code blocks? Errors? Undesired results? – Parfait Mar 11 '23 at 22:00
  • 1
    I may be wrong but ``prune.fit[i]`` call might be incorrect (does taking ``[i]`` makes sense for a model object?) – runr Mar 11 '23 at 22:44

1 Answers1

0

It's hard to answer without knowing the packages used and no data, but the following code might be a step forward, see if it makes sense:

lapply(2:20, function(i){
  prune.fit <- prune.tree(fit, best = i)
  
  # train
  prediction_train <- predict(prune.fit, newdata = my_ahp_train)
  mse_train <-  mean((prediction_train - my_ahp_train$sale_price)^2)
  
  # repeat the same for test 
  prediction_test <- predict(prune.fit, newdata = my_ahp_test)
  mse_test <-  mean((prediction_test - my_ahp_test$sale_price)^2)
  c(i = i, mse_train = mse_train, mse_test = mse_test)
  
}) %>% do.call(rbind, .)
runr
  • 1,142
  • 1
  • 9
  • 25
  • Ok this worked actually. Do you by chance know why using lapply is better than using just for (i in range) to generate a loop? I have been kind of confused about how to know which method to use. Thank you for your time! – madibecoding Mar 15 '23 at 15:22
  • @madibecoding [this answer](https://stackoverflow.com/a/42440872/3629151) might be interesting and useful. In my opinion, the key advantage is the "no side effects" property, since everything within the function body is running in isolation and does not "leak" into the global environment. This implies a smaller chance for bugs (no variables stored by the same name), cleaner operation (all the inner variables are deleted after they are not used anymore, which can also help with bugs), is straightforward to parallelize (if needed), the result collection is automatic, no need to pre-program lists – runr Mar 15 '23 at 20:50