5

I was creating a linear model for my assignment :

lm(revenue ~ (max_cpc - max_cpc.mean), data = traffic)  

But it throws:

Error in model.frame.default(formula = revenue ~ (max_cpc - max_cpc.mean),  : 
   variable lengths differ (found for 'maxcpc.mean') 

Then, through trial and error, I slightly modified my code :

lm(revenue ~ I(max_cpc - max_cpc.mean), data = traffic)

and Bingo!!!It worked well.

But now I am trying to figure out the significance of 'I' and how it fixed my problem. Can anyone explain it to me?

heybhai
  • 77
  • 2
  • 9
  • This question appears to be off-topic because it does not show any effort to solve. – Carl Witthoft Jun 13 '14 at 11:45
  • Hi @CarlWitthoft please let me know what's so wrong in this question that it need to be downvoted/closed and can I ask you if it seems effortless to solve, why didn't you answered it first than commenting over it. – heybhai Jun 13 '14 at 12:43
  • @CarlWitthoft Please provide me link of the question you say its original and mine as duplicate.Thanks – heybhai Jun 13 '14 at 13:32
  • 1
    dupe: http://stackoverflow.com/questions/24192428/capital-letter-i-in-r-linear-regression – Carl Witthoft Jun 13 '14 at 13:35

1 Answers1

12

I() prevents the formula-interface from interpreting the argument, so it gets passed along instead to the expression-parsing part.

In the formula interface -x means 'remove x from the predictors'. So I can do y~.-x to mean 'fit y against everything but x'.

You don't want it to do that - you actually want to make a variable that is the difference of two variables and regress on that, so you don't want the formula interface to parse that expression.

I() achieves that for you.

Terms with squaring in them (x^2) also need the same treatment. The formula interface does something special with powers, and if you actually want a variable squared you have to I() it.

I() has some other uses in other contexts as well. See ?I

Glen_b
  • 7,883
  • 2
  • 37
  • 48
  • 1
    Also, when you apply certain operators to two vectors of different lengths, the shorter vector is recycled - its values are repeated until it's the same length as the longer vector. The "variable lengths differ" error indicates that max_cpc.mean is shorter than max_cpc, and wrapping it in the I() function causes R to do the difference while recycling the shorter vector. – neverKnowsBest Jun 13 '14 at 04:31
  • Further to the variable lengths issue, it also suggests that at least one of the variables was not in the `data.frame` `traffic`. You might want to double check that that is indeed what you wanted. – John Jun 13 '14 at 04:40
  • 1
    @John presumably `max_cpc.mean` is the mean of `max_cpc` and is thereby probably of length 1 – Glen_b Jun 13 '14 at 04:42
  • Glen_b, perhaps, or perhaps it was intended as a vector of conditions means sorted on some other variable, or maybe it's a typo, or maybe it's... – John Jun 13 '14 at 04:58
  • @John max_cpc.mean is the mean of max_cpc's its not a typo. – heybhai Jun 13 '14 at 05:54
  • @neverKnowsBest can you please explain me the logic in R for recycling the vectors of different length. – heybhai Jun 13 '14 at 05:57
  • 1
    heybhai, you should ask a new question for that. – John Jun 13 '14 at 06:27
  • Sure @John I hope you would be there to help me out. – heybhai Jun 13 '14 at 12:40
  • Is it exactly the same using I() than calculating the operation previously? Can I() be used for the predicted variable too? – skan Jun 06 '18 at 11:42
  • Generally it will be the same but it may be possible to contrive a situation where something will be changed in between. It can be used on the left side of the formula as well. – Glen_b Jun 06 '18 at 15:20