0

I am doing this:

RMSE <- (sum((RFestimated-model1$y)^2)/length(model1$y))^(1/2)

where: mode1 is regression model from a Random Forest and y is the value being predicted from the Training data RFestimated is the predicted value from the test data

I am trying to calculate RMSE Is there a trick to making the lengths equal?

These are my steps: (code)

# sample 80% of the data for training -random sample
train_index <- sample(1:nrow(beijingData), 0.8 * nrow(beijingData))
# take the difference as data to test the model
test_index <- setdiff(1:nrow(beijingData), train_index)

#create Train and Test data sets based on the indexes above.
dataTrain <-  beijingData[train_index,]
dataTest <- beijingData[test_index,]

#check the datasets dimensions
dim(dataTrain)
dim(dataTest)

> dim(dataTrain)
[1] 33405    13
> dim(dataTest)
[1] 8352   13

#set seed
set.seed(100)
#create a random forest regression model
model1 <- randomForest(pm2.5 ~ ., data = dataTrain, ntree=500, importance = 
TRUE)
model1

#predict with test data
RFestimated <- predict(model1, dataTest)

[1] 118.7794
> length(RFestimated)
[1] 8352
> length(model1$y)
[1] 33405

qqnorm((RFestimated - model1$y)/sd(RFestimated-model1$y))

qqline((RFestimated-model1$y)/sd(RFestimated-model1$y))

#results of last tow statements above
> qqnorm((RFestimated - model1$y)/sd(RFestimated-model1$y))
Warning messages:
1: In RFestimated - model1$y :
  longer object length is not a multiple of shorter object length
2: In RFestimated - model1$y :
  longer object length is not a multiple of shorter object length
> 
> qqline((RFestimated-model1$y)/sd(RFestimated-model1$y))
Warning messages:
1: In RFestimated - model1$y :
  longer object length is not a multiple of shorter object length
2: In RFestimated - model1$y :
  longer object length is not a multiple of shorter object length
duckmayr
  • 16,303
  • 3
  • 35
  • 53
Jawahar
  • 183
  • 4
  • 16
  • 3
    Please see [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). You will likely get an answer to your question more quickly if you have a reproducible example. Provide the smallest input data needed to reproduce what you are asking and the desired output. – steveb Aug 03 '18 at 20:21
  • I have a hunch that `model1$y` is the original `y` and has a different length than your predicted values. – coffeinjunky Aug 03 '18 at 23:21

1 Answers1

0

Have a look at these lines here:

#predict with test data
RFestimated <- predict(model1, dataTest)

[1] 118.7794
> length(RFestimated)
[1] 8352
> length(model1$y)
[1] 33405

What you see is that their lengths differ. How is this supposed to work? Think about what you are trying to do:

a <- c(1,2,3)
b <- c(4,5)
a-b
[1] -3 -3 -1
Warning message:
In a - b : longer object length is not a multiple of shorter object length

You either need to evaluate the RMSE on the train data, or on the test data, but you are mixing them. That is, either this

RFestimated <- predict(model1, dataTrain)
qqnorm((RFestimated - model1$y)/sd(RFestimated-model1$y))

would work, or this:

RFestimated <- predict(model1, dataTest)
qqnorm((RFestimated - dataTest$y)/sd(RFestimated-dataTest$y))

The first option tells you how good you are fitting the data in the sample used for fitting, and the second gives you the performance on the test data.

coffeinjunky
  • 11,254
  • 39
  • 57