R random forest - training set using target column for prediction

Question

I am learning how to use various random forest packages and coded up the following from example code:

library(party)
library(randomForest)

set.seed(415)

#I'll try to reproduce this with a public data set; in the mean time here's the existing code
data = read.csv(data_location, sep = ',')
test = data[1:65]  #basically data w/o the "answers"

m = sample(1:(nrow(factor)),nrow(factor)/2,replace=FALSE)
o = sample(1:(nrow(data)),nrow(data)/2,replace=FALSE)

train2 = data[m,]
train3 = data[o,]

#random forest implementation
fit.rf <- randomForest(train2[,66] ~., data=train2, importance=TRUE, ntree=10000)
Prediction.rf <- predict(fit.rf, test) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]

#cforest implementation
fit.cf <- cforest(train3[,66]~., data=train3, controls=cforest_unbiased(ntree=10000, mtry=10))
Prediction.cf <- predict(fit.cf, test, OOB=TRUE) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]

Data[,66] is the is the target factor I'm trying to predict, but it seems that by using "~ ." to solve for it is causing the formula to use the factor in the prediction model itself.

How do I solve for the dimension I want on high-ish dimensionality data, without having to spell out exactly which dimensions to use in the formula (so I don't end up with some sort of cforest(data[,66] ~ data[,1] + data[,2] + data[,3}... etc.?

EDIT: On a high level, I believe one basically

loads full data
breaks it down to several subsets to prevent overfitting
trains via subset data
generates a fitting formula so one can predict values of target (in my case data[,66]) given data[1:65].

so my PROBLEM is now if I give it a new set of test data, let’s say test = data{1:65], it now says “Error in eval(expr, envir, enclos) :” where it is expecting data[,66]. I want to basically predict data[,66] given the rest of the data!

I don't see a documented function named `cforest` in `library(randomForest)`. Is this the right package? — MrFlick, Jun 13 '14 at 13:48
Does `data` have column names? And what is `train3`? Does `train3` only have covariates? From your example it seems `data` has all the variables so maybe that should be in the `data=` parameter. This is why it's always best to provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — MrFlick, Jun 13 '14 at 13:50
Data has column names -- but does using them help? Since it's a high (well, 60+) dimensionality vector, I didn't spell out the columns using c(col) while importing, but I did do some amount of preprocessing to make sure all dimensions are representable in numeric format. train3 = is the training set, a randomized pick of 50% subset of data. (Thanks for the edit.) — binarysolo, Jun 13 '14 at 13:57

jld · Accepted Answer · 2014-06-13T14:18:27.083

1

I think that if the response is in train3 then it will be used as a feature.

I believe this is more like what you want:

crtl <- cforest_unbiased(ntree=1000, mtry=3)

mod <- cforest(iris[,5] ~ ., data = iris[,-5], controls=crtl)

edited Jun 13 '14 at 14:18

answered Jun 13 '14 at 14:05

jld

466
10
20

Whoops, I changed it up and elaborated it a bit more. – binarysolo Jun 13 '14 at 14:22
My understanding is that if you pass the `cforest` or `randomForest` a data set with 66 columns then it will fit the model on that and will require new data to have 66 columns. It looks like you are including the response as a feature when you fit the model which is why it expects 66 columns and doesn't work when you try to use `predict` with a data frame of only 65 columns. – jld Jun 13 '14 at 14:29
Doh - so how should I go about training something with solutions but not have the solutions be included for my set of predictions? – binarysolo Jun 13 '14 at 14:33
With `randomForest` you can make the feature matrix `x` and the response `y` and then fit the model via `randomForest(x = x, y = y, ...)`; `cforest` doesn't seem to have this option so it may be necessary to instead do as I did in my original answer: `cforest(dat[,66] ~ ., data = dat[,-66])`. – jld Jun 13 '14 at 14:38

R random forest - training set using target column for prediction

1 Answers1