1

In common with other machine learning methods, I divided my original data set (7-training data set: 3-test data set).

Here is my code.

install.packages(randomForestSRC)
library(randomForestSRC)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)

train <- sample(1:nrow(data), round(nrow(data) * 0.70))

data.grow <- rfsrc(Surv(days, status) ~ ., 
                   data[train, ], 
                   ntree = 100,
                   tree.err=T,
                   importance=T,
                   nsplit=1,
                   proximity=T)

data.pred <- predict(data.grow, 
                     data[-train , ],
                     importance=T,
                     tree.err=T)

What I have a question is that predict function in this code.

Originally, I wanted to construct a prediction model based on random survival forest to predict the diseae development.

For example, After I build the prediction model with training data set, I wanted to know the probability of disease development with test data which has no information about disease incidence for each individual becuase I would like to know the probability of diease development based on the subject's general characteristics such as age, bmi, sex, something like that.

However, unlike my intention to build a predicion model as I said above, "predict" function in this package didn't work based on the data which has no status information (event/censored).

"predict" function must work with outcome information (event/censored).

Therefore, I cannot understand what the "predict" function means.

If "precict" function works only with oucome information, then how can I make a predction for disease development based on the subject's general characteristics in the future?

In addition, if the prediction in this model is constructed with the outcome information, what the meaning is "predct" in the random survival forest model.

Please let me know what the "predict" function in this package means is.

Thank you for reading my long question.

SJUNLEE
  • 167
  • 2
  • 14

1 Answers1

2

The predict for this type of model, i.e. predict.rfsrc, works much like you'd expect it to if you've used predict with glm, lm, RRF or other models.

The predict statement does not require you to know the outcome for the prediction data set. I am trying to understand why you thought that it did. Your example rfsrc statement does not work because it refers to columns that are not in the example data set.

I think the best plan is that I will show you using a reproducible example, below. If you have further questions you can ask me in a comment.

# Train a RFSRC model
mtcars.mreg <- rfsrc(Surv(mpg, cyl) ~., data = mtcars[1:30,],
                     tree.err=TRUE, importance = TRUE)

# Simulate new data
new_data <- mtcars[31:32,]


# predict
predicted <-predict(mtcars.mreg, new_data)
predicted
  Sample size of test (predict) data: 2
                Number of grow trees: 1000
  Average no. of grow terminal nodes: 4.898
         Total no. of grow variables: 9
                            Analysis: RSF
                              Family: surv-CR
                 Test set error rate: NA
predicted$predicted
       event.1  event.2  event.3
[1,] 0.4781338 2.399299 14.71493
[2,] 3.2185606 4.720809  2.15895
Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • Thank you for your reply. There is an error first line in my code. When you revise the code install.packages(randomForestSRC) --> install.packages("randomForestSRC") then it should be working. Moreover, when I conduct predict function based on the data without y-variables(days, status), the results what I've wanted can be shown. So, in my example, the RSF model can predict "survival probability" and "time.interest". – SJUNLEE Jul 06 '18 at 05:53