0

R predict.lm function gives output of wrong size.

stocks = read.csv("some-file.csv", header = TRUE)

## 75% of the sample size
smp_size <- floor(0.75 * nrow(stocks))

## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(stocks)), size = smp_size)

train <- stocks[train_ind, ]
test <- stocks[-train_ind, ]

model = lm ( train$Open ~ train$Close, data=train)
model
predicted<-predict.lm(model, test$Open)
length(test$Open)
length(predicted)
length(test$Close)

> length(test$Open)
[1] 16994
> length(predicted)
[1] 50867
> length(test$Close)
[1] 16994

Why this is happening? output length of the predict functions should be equal to length of the test$Open , right?

Vishwajeet Vatharkar
  • 1,146
  • 4
  • 18
  • 42

2 Answers2

0

I can't say exactly how lm will interpret your train$Open and train$Close, but I can say your data=stocks is your problem. So, I can tell you where lm is getting your data from and why it isn't the length of your train set. You want model <- lm(Open ~ Close, data=train

doctorG
  • 1,681
  • 1
  • 11
  • 27
  • Changed it so, but the same problem continues to exist. Also it is showing me error `> predicted<-predict.lm(model, newdata=test$Open) Error in eval(predvars, data, env) : numeric 'envir' arg not of length one` – Vishwajeet Vatharkar Jan 14 '16 at 09:48
  • @VishwajeetVatharkar, have you read the help for lm? Why do you keep using ? – doctorG Jan 14 '16 at 13:53
0

The problem lies in predicted<-predict.lm(model, test$Open) it should be

 predicted<-predict.lm(model, test)

the response is deleted in predict.lm anyhow in the

 line 15:       Terms <- delete.response(tt)

Actually it should have been test$Close for your model anyhow.

What you got was the result for the training set as effectivly you weren't providing any data at all (after the code delted the response. An example using iris

train_ind <- sample(seq_len(nrow(iris)),size=100)
train <- iris[train_ind,]
test <- iris[-train_ind,]
model=lm(Sepal.Length ~Sepal.Width,data=train)
model
predicted1 <-predict.lm(model,test)
length(predicted)
#fake response to keep dataframe structure
predicted2 <-predict.lm(model, predict.lm(model,data.frame(Sepal.Width=test$Sepal.Width))
length(predicted2)
predicted1-predicted2

the output of the last few lines

length(predicted)
[1] 50
> predicted2 <- predict.lm(model,data.frame(Sepal.Width=test$Sepal.Width)
> length(predicted2)
[1] 50
> predicted1-predicted2
  4   5   9  10  12  17  19  25  26  32  33  36  37  40  41  47  49  53  61  67  68  69  74  76  78  79  81  83  84  85  87 
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
 92  94  98 105 110 112 113 114 122 125 127 128 132 133 137 140 141 142 145 
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
CAFEBABE
  • 3,983
  • 1
  • 19
  • 38
  • It gives warning like `> predicted<-predict.lm(model, test) Warning message: 'newdata' had 16994 rows but variables found have 50980 rows ` And still the problem continues – Vishwajeet Vatharkar Jan 14 '16 at 10:16
  • so, the second argument for `predict.lm` should be test data input (X variable) right? – Vishwajeet Vatharkar Jan 14 '16 at 10:18
  • No it should be the test set this will give the same results and avoids mistakes. At minimum you need to add test$Close.. Where is your warning from the other comment coming from? – CAFEBABE Jan 14 '16 at 10:23
  • I added an small example where it works using `iris` as your data is not available. – CAFEBABE Jan 14 '16 at 10:40