1

I'm struggling to understand how the predict function works and can be used with different sample data. For instance the following code...

my <- data.frame(x=rnorm(1000))  
my$y <- 0.5*my$x+0.5*rnorm(1000)
fit <- lm(my$y ~ my$x)   
mySample <- my[sample(nrow(my), 100),]    
predict(fit, mySample)

I would understand should return 100 y predictions based on the sample. But it returns 1,000 row with the warning message :

'newdata' had 100 rows but variables found have 1000 rows

How do I produce a set of predictions based on a new set of data using predict? Or am I using the wrong function? I am a noob so apologise in advance if I am asking stupid questions.

Graeme
  • 333
  • 3
  • 14
  • \~ is just tilda, the \ was needed to make it display. Is there a tag code to quote code without it being messed up, I had to backslash the dollar signs to get them to display normally also. – Graeme Aug 14 '14 at 23:24
  • This question is off-topic here as it is about using R functions correctly. – Momo Aug 14 '14 at 23:32
  • @user3762838 format as code – Jeromy Anglim Aug 14 '14 at 23:35
  • This appears to be because you ignored the error at the previous step. Try `mySample <- my[sample(seq_along(my$x), 100),]` but also consider whether you really wanted sampling with rather than without replacement. – Glen_b Aug 15 '14 at 00:09

3 Answers3

1

It's never a good idea to use the $ symbol when using the formula syntax (and most of the times it's completely unnecessary. This is especially true when you are trying to make predictions because the predict() function works hard to exactly match up column names and data.types. So rather than

fit <- lm(my$y ~ my$x)

use

fit <- lm(y ~ x, my)

So a complete example would be

set.seed(15) # for reproducibility
my <- data.frame(x=rnorm(1000))  
my$y <- 0.5*my$x+0.5*rnorm(1000)
fit <- lm(y ~ x, my)
mySample <- my[sample(1:nrow(my), 100),]    
head(predict(fit, mySample))
#         694         278         298         825         366         980 
#  0.43593108 -0.67936324 -0.42168723 -0.04982095 -0.72499087  0.09627245 
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • This solved the problem. I'm not sure why, I'd have thought they were functionally identical, but the predict ignores the sample when I use the first fit statement and returns 1,000 rows, but only returns 100 rows as expected when the second fit statement is used. – Graeme Aug 15 '14 at 09:20
0

couple of things wrong with the code: you are overwriting the sample function with your variable named sample. you want something like mysample<- sample(my\$x,100) ... its nothing to do with predict. From my limited understanding dataframes are 'lists of columns' so sampling my means creating 100 samples of (the 1000 row) column x. by using my\$x you now are referring to the column ( in the dataframe), which is a list of rows.

In other words you are sampling from a list of columns (which only has a single element), but you actually want to sample from a list of the rows in column x

seanv507
  • 1,206
  • 1
  • 11
  • 23
  • You are right, I was being a bit lazy using 'sample' as the sample, I've updated the code to use mySample to differentiate it from the function call. I'm not sure I understand the second part. sample/mySample is a set of 100 rows passed to predict function. The help doesn't say this, but from what I'd understood from I've read on the internet, this goes to the newdata argument, and predict should give me the predictions based on each row of the data.frame. Unfortunately I seem to be wrong on this. – Graeme Aug 14 '14 at 23:36
  • my is a dataframe = list of columns, my$x is a list of rows. Just try what I suggested. – seanv507 Aug 14 '14 at 23:40
0

Is this what you want

library(caret)
my <- data.frame(x=rnorm(1000))  
my$y <- 0.5*my$x+0.5*rnorm(1000)

## Divide data into train and test set 

 Index <- createDataPartition(my$y, p = 0.8, list = FALSE, times = 1)


 train <- my[Index, ]
 test <-  my[-Index,]

 lmfit<- train(y~x,method="lm",data=train,trControl = trainControl(method = "cv"))

 lmpredict<-predict(lmfit,test)

this for an in-sample prediction for pseudo out of sample prediction (forecasting one step ahead) you just need lag the independent variable by 1

 Lag(x)