69

I posted earlier today about an error I was getting with using the predict function. I was able to get that corrected, and thought I was on the right path.

I have a number of observations (actuals) and I have a few data points that I want to extrapolate or predict. I used lm to create a model, then I tried to use predict with the actual value that will serve as the predictor input.

This code is all repeated from my previous post, but here it is:

df <- read.table(text = '
     Quarter Coupon      Total
1   "Dec 06"  25027.072  132450574
2   "Dec 07"  76386.820  194154767
3   "Dec 08"  79622.147  221571135
4   "Dec 09"  74114.416  205880072
5   "Dec 10"  70993.058  188666980
6   "Jun 06"  12048.162  139137919
7   "Jun 07"  46889.369  165276325
8   "Jun 08"  84732.537  207074374
9   "Jun 09"  83240.084  221945162
10  "Jun 10"  81970.143  236954249
11  "Mar 06"   3451.248  116811392
12  "Mar 07"  34201.197  155190418
13  "Mar 08"  73232.900  212492488
14  "Mar 09"  70644.948  203663201
15  "Mar 10"  72314.945  203427892
16  "Mar 11"  88708.663  214061240
17  "Sep 06"  15027.252  121285335
18  "Sep 07"  60228.793  195428991
19  "Sep 08"  85507.062  257651399
20  "Sep 09"  77763.365  215048147
21  "Sep 10"  62259.691  168862119', header=TRUE)

str(df)
'data.frame':   21 obs. of  3 variables:
 $ Quarter   : Factor w/ 24 levels "Dec 06","Dec 07",..: 1 2 3 4 5 7 8 9 10 11 ...
 $ Coupon: num  25027 76387 79622 74114 70993 ...
 $ Total: num  132450574 194154767 221571135 205880072 188666980 ...

Code:

model <- lm(df$Total ~ df$Coupon, data=df)

> model

Call:
lm(formula = df$Total ~ df$Coupon)

Coefficients:
(Intercept)    df$Coupon  
  107286259         1349 

Predict code (based on previous help):

(These are the predictor values I want to use to get the predicted value)

Quarter = c("Jun 11", "Sep 11", "Dec 11")
Total = c(79037022, 83100656, 104299800)
Coupon = data.frame(Quarter, Total)

Coupon$estimate <- predict(model, newdate = Coupon$Total)

Now, when I run that, I get this error message:

Error in `$<-.data.frame`(`*tmp*`, "estimate", value = c(60980.3823396919,  : 
  replacement has 21 rows, data has 3

My original data frame that I used to build the model had 21 observations in it. I am now trying to predict 3 values based on the model.

I either don't truly understand this function, or have an error in my code.

Help would be appreciated.

Thanks

Community
  • 1
  • 1
mikebmassey
  • 8,354
  • 26
  • 70
  • 95
  • 3
    You almost certainly need to use the `data` argument to `lm` t get this to work, i.e. `model <- lm(Total ~ Coupon, data=df)`. Then I would suggest `Coupon$estimate <- predict(model, newdata = Coupon)$Total` – Ben Bolker Jan 27 '12 at 03:46
  • 2
    @BenBolker I agree on the first part, not so sure about the second. I think `predict(model, newdata = Coupon)` should be what he wants. – joran Jan 27 '12 at 03:50
  • 1
    @joran yes, I think you're right. – Ben Bolker Jan 27 '12 at 03:51
  • @BenBolker & @joran Updated the code to reflect the `data=df` that Ben suggested. Same result. Then I updated it to joran's suggestion. Same error. – mikebmassey Jan 27 '12 at 03:52
  • You didn't update it as Ben indicated. Notice a difference in your formula specifications? `df$Total` versus just `Total`. Your way, when you use `predict`, its looking for a variable named `df$Coupon` rather than just `Coupon` (I think). At the very least, the names don't match up. – joran Jan 27 '12 at 04:01
  • @mikebmassey -- Also, go have another look at my answer to your question. I had given you incorrect info w/ my first answer, but updated it several hours ago. I think the answer is now pretty good, and makes the additional suggestion (in accord with `?predict.lm`) that `newdata` should be a **data.frame** containing the `Coupon` or any other predictor variables. Sorry -- thought SO would automatically notify you of the change to my answer and addition of a comment. – Josh O'Brien Jan 27 '12 at 04:04

4 Answers4

105

First, you want to use

model <- lm(Total ~ Coupon, data=df)

not model <-lm(df$Total ~ df$Coupon, data=df).

Second, by saying lm(Total ~ Coupon), you are fitting a model that uses Total as the response variable, with Coupon as the predictor. That is, your model is of the form Total = a + b*Coupon, with a and b the coefficients to be estimated. Note that the response goes on the left side of the ~, and the predictor(s) on the right.

Because of this, when you ask R to give you predicted values for the model, you have to provide a set of new predictor values, ie new values of Coupon, not Total.

Third, judging by your specification of newdata, it looks like you're actually after a model to fit Coupon as a function of Total, not the other way around. To do this:

model <- lm(Coupon ~ Total, data=df)
new.df <- data.frame(Total=c(79037022, 83100656, 104299800))
predict(model, new.df)
user987339
  • 10,519
  • 8
  • 40
  • 45
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • I think you've got the formula backwards. Also, `new.df` should contain `Coupon` instead of `Total`. Also, my answer to the original question works as well ;) – Josh O'Brien Jan 27 '12 at 04:12
  • 1
    @JoshO'Brien: I'm going off the newdata that the OP posted, which specifies values of `Total`. That would imply that he's actually after a model to predict `Coupon`. – Hong Ooi Jan 27 '12 at 04:15
  • But he always put `Total` on the LHS of the formula, as do you in the opening line of your post! Unless I'm unbelievably confused, `Coupon` is meant to be the predictor. (Not that it matters near as much as the concepts you're trying to get across). – Josh O'Brien Jan 27 '12 at 04:22
  • I suspect the OP may be confused about which side of the `~` the response variable is supposed to be on. I'll update my answer. – Hong Ooi Jan 27 '12 at 04:26
  • Thanks for the help on this guys. To Josh's point - I am trying to predict `Coupon`, not `Total`, so apologies if I am confusing everyone. I did as Hong laid out and did get it to work. Thanks for that. However, when I run `predict(model, new.df)`, I still get 21 observations instead of the 3 I was trying to determine in `new.df`. The whole point of `predict` is to use `lm` and predict new values, right, or am I just confused on its function? Thanks again. – mikebmassey Jan 27 '12 at 04:27
  • 1
    @mikebmassey: check my answer again, I've just edited it. Make sure you have `Coupon` on the LHS of the formula, and you've entered your code exactly as I've got it in the last 3 lines of my answer. – Hong Ooi Jan 27 '12 at 04:31
11

Thanks Hong, that was exactly the problem I was running into. The error you get suggests that the number of rows is wrong, but the problem is actually that the model has been trained using a command that ends up with the wrong names for parameters.

This is really a critical detail that is entirely non-obvious for lm and so on. Some of the tutorial make reference to doing lines like lm(olive$Area@olive$Palmitic) - ending up with variable names of olive$Area NOT Area, so creating an entry using anewdata<-data.frame(Palmitic=2) can't then be used. If you use lm(Area@Palmitic,data=olive) then the variable names are right and prediction works.

The real problem is that the error message does not indicate the problem at all:

Warning message: 'anewdata' had 1 rows but variable(s) found to have X rows

agenis
  • 8,069
  • 5
  • 53
  • 102
David Burton
  • 1,130
  • 10
  • 12
  • Thanks, this is a very important point, I got the error you mentioned as well. To apply your answer to Hong's response: if the column in his new.df was not named "Total", which is the same column name as the original data frame, then he would get the error that you (and I) got. So it's important to make sure the column name in your newdata is the same as the predictor in your original model. – NeonBlueHair Nov 08 '14 at 21:47
  • Flagging this as not an answer. The use of the `@`-operator indicates you were dealing with an S4-object and that has nothing to do with the origianl question nor the answer. You have incorrectly confused your difficulties with an unspecified homework problem with a more simple problem that was adequately answered. – IRTFM Dec 01 '16 at 02:31
5

To avoid error, an important point about the new dataset is the name of independent variable. It must be the same as reported in the model. Another way is to nest the two function without creating a new dataset

model <- lm(Coupon ~ Total, data=df)
predict(model, data.frame(Total=c(79037022, 83100656, 104299800)))

Pay attention on the model. The next two commands are similar, but for predict function, the first work the second don't work.

model <- lm(Coupon ~ Total, data=df) #Ok
model <- lm(df$Coupon ~ df$Total) #Ko
Alessio
  • 93
  • 1
  • 7
4

instead of newdata you are using newdate in your predict code, verify once. and just use Coupon$estimate <- predict(model, Coupon) It will work.

sumalatha
  • 69
  • 1