0

I was able to successfully run an RF model using some R code I was given. That is below and it includes snippet of my data too.

The only problem is that the way the code is written it only outputs a vector of probabilities and no data from the original test data set called "testset". So now I am trying to figure out how to output my probabilities along with the original data frame because I couldn't find a solution online. In other words I want it to be another column in the data set, like right after my FLSAStat column. That's so I can then output all of ittogether to a csv file.

Here's what I have:

#####################################################
# 1. SETUP DATA
#####################################################
mydata <- read.csv("train_test.csv", header=TRUE)
colnames(testset)
[1] "train"           "Target"          "ApptCode"        "Directorate"         "New_Discipline"  "Series"          "Adjusted.Age"   
[8] "Adj.Service"     "Adj.Age.Service" "HiEducLv"        "Gender"           "RetCd"           "FLSAStat"  
> head(testset)
 train Target ApptCode                Directorate             New_Discipline Series  Adjusted.Age Adj.Service Adj.Age.Service HiEducLv Gender
5909     0     NA       IN                   Business Math  Computer Science  IT     PSTS        54.44          10           64.44 Bachelor   Male
5910     0     NA       IN                Computation Math  Computer Science  IT   PSTS        51.51          15           66.51 Bachelor   Male
5911     0     NA       IN Physical and Life Sciences                    Physics   PSTS        40.45           5           45.45      PHD   Male
5912     0     NA       IN  Weapons and Complex Integ                    Physics   PSTS        62.21          35           97.21      PHD   Male
5913     0     NA       IN  Weapons and Complex Integ                    Physics   PSTS        45.65          15           60.65      PHD   Male
5914     0     NA       FX Physical and Life Sciences                    Physics   PSTS        36.13           5           41.12      PHD   Male
  RetCd FLSAStat
5909  TCP2        E
5910  TCP2        E
5911  TCP2        E
5912  TCP2        E
5913  TCP1        E
5914  TCP2        E    

#create train and test sets
trainset = mydata[mydata$train == 1,]
testset = mydata[mydata$train == 0,]
#eliminate unwanted columns from train set 
trainset$train = NULL
#####################################################
# 2. set the formula
#####################################################
theTarget <- "Target"
theFormula <- as.formula(paste("as.factor(",theTarget, ") ~ . "))
theFormula1 <- as.formula(paste(theTarget," ~ . "))
trainTarget = trainset[,which(names(trainset)==theTarget)]
testTarget  = testset[,which(names(testset)==theTarget)]

#####################################################
# Random Forest
#####################################################
library(randomForest)
what <- "Random Forest"
FOREST_model <- randomForest(theFormula, data=trainset, ntree=500)
train_pred <- predict(FOREST_model, trainset, type="prob")[,2]
test_pred <- predict(FOREST_model, testset, type="prob")[,2]
display_results()
testID  <- testset$case_id
predictions <- test_pred
submit_file = cbind(testID,predictions)
write.csv(submit_file, file="RANDOM4.csv", row.names = FALSE)

I think the problem is that I am lacking an additional line of code that does the merging of the predictions vector back into testSet. I'm guessing this this would go somewhere before the third to last line of code.

daniellopez46
  • 594
  • 3
  • 7
  • 17
  • 2
    Hi! Do you mind reviewing [this question](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and revising your question? It will be much easier for us to help you if you can provide a reproducible example of your starting dataset, the probability output from your model, and how those two things should be joined back together. In lieu or in addition to that, take a look at `cbind()`, `rbind()`, `merge()`, or `match()` to do what you need to do...these first two simply combine objects by rows or columns while the last two are roughly equivalent to SQL joins – Chase Jun 11 '12 at 23:52

1 Answers1

0

Just add the column to your dataframe like so:

testset$Predictions <- test_pred
write.csv(testset, file="RANDOM4.csv", row.names = FALSE)
David Robinson
  • 77,383
  • 16
  • 167
  • 187