0

I am getting the following error when calculating VIF on a small dataset in Rstudio. Could anyone help? I can provide more information on the dataset if needed.

"Error in as.vector(y) - mean(y) non-numeric argument to binary operator".

Dataset: 80 obs. and 15 variables (all variables are numeric)

Steps Followed:

   # 1. Determine correlation  
    library(corrplot)  
    cor.data <- cor(train)  
    corrplot(cor.data, method = 'color')  
    cor.data    


# 2. Build Model  

    model2 <- lm(Volume~., train)  
    summary(model2)  

# 3. Calculate VIF  

    library(VIF)  
    vif(model2) 

Here is a sample dataset with 20 obs.

train <- structure(list(Price = c(949, 2249.99, 399, 409.99, 1079.99, 
114.22, 379.99, 65.29, 119.99, 16.99, 6.55, 15, 52.5, 21.08, 
18.98, 3.6, 3.6, 174.99, 9.99, 670), X.5.Star.Reviews. = c(3, 
2, 3, 49, 58, 83, 11, 33, 16, 10, 21, 75, 10, 313, 349, 8, 11, 
170, 15, 20), X.4.Star.Reviews. = c(3, 1, 0, 19, 31, 30, 3, 19, 
9, 1, 2, 25, 8, 62, 118, 6, 5, 100, 12, 2), X.3.Star.Reviews. = c(2, 
0, 0, 8, 11, 10, 0, 12, 2, 1, 2, 6, 5, 13, 27, 3, 2, 23, 4, 4
), X.2.Star.Reviews. = c(0, 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3, 
0, 8, 7, 2, 2, 20, 0, 2), X.1.Star.Reviews. = c(0, 0, 0, 9, 36, 
40, 1, 9, 2, 0, 15, 3, 1, 16, 5, 1, 1, 20, 4, 4), X.Positive.Service.Review.   = c(2, 
1, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9, 2, 44, 57, 0, 0, 310, 3, 4), 
    X.Negative.Service.Review. = c(0, 0, 0, 8, 20, 5, 0, 3, 1, 
    0, 1, 2, 0, 3, 3, 0, 0, 6, 1, 3), X.Would.consumer.recommend.product. = c(0.9, 
    0.9, 0.9, 0.8, 0.7, 0.3, 0.9, 0.7, 0.8, 0.9, 0.5, 0.2, 0.8, 
    0.9, 0.9, 0.8, 0.8, 0.8, 0.8, 0.7), X.Shipping.Weight..lbs.. = c(25.8, 
    50, 17.4, 5.7, 7, 1.6, 7.3, 12, 1.8, 0.75, 1, 2.2, 1.1, 0.35, 
    0.6, 0.01, 0.01, 1.4, 0.4, 0.25), X.Product.Depth. = c(23.94, 
    35, 10.5, 15, 12.9, 5.8, 6.7, 7.9, 10.6, 10.7, 7.3, 21.3, 
    15.6, 5.7, 1.7, 11.5, 11.5, 13.8, 11.1, 5.8), X.Product.Width. = c(6.62, 
    31.75, 8.3, 9.9, 0.3, 4, 10.3, 6.7, 9.4, 13.1, 7, 1.8, 3, 
    3.5, 13.5, 8.5, 8.5, 8.2, 7.6, 1.4), X.Product.Height. = c(16.89, 
    19, 10.2, 1.3, 8.9, 1, 11.5, 2.2, 4.7, 0.6, 1.6, 7.8, 15, 
    8.3, 10.2, 0.4, 0.4, 0.4, 0.5, 7.8), X.Profit.margin. = c(0.15, 
    0.25, 0.08, 0.08, 0.09, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 
    0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.15), Volume = c(12, 
    8, 12, 196, 232, 332, 44, 132, 64, 40, 84, 300, 40, 1252, 
    1396, 32, 44, 680, 60, 80)), .Names = c("Price", "X.5.Star.Reviews.", 
"X.4.Star.Reviews.", "X.3.Star.Reviews.", "X.2.Star.Reviews.", 
"X.1.Star.Reviews.", "X.Positive.Service.Review.", "X.Negative.Service.Review.", 
"X.Would.consumer.recommend.product.", "X.Shipping.Weight..lbs..", 
"X.Product.Depth.", "X.Product.Width.", "X.Product.Height.", 
"X.Profit.margin.", "Volume"), row.names = c(NA, 20L), class = "data.frame")
Mavs18
  • 13
  • 1
  • 1
  • 3
  • 1
    You're gonna have to provide a reproducible example, as we'll be guessing which type of variables you're using. A good guide for reproducible examples is this [one](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – cimentadaj Oct 21 '16 at 02:37
  • I just reread that all variables are numeric. But still, we won't be able to figure out the problem without a glimpse of your data/toy dataset. – cimentadaj Oct 21 '16 at 02:53
  • I am new to using stack overflow and have tried to update the question to include some dataset. Does this help at all? – Mavs18 Oct 21 '16 at 02:54
  • Not much. If you want to show us the data you've posted try `dput(data)` and copy that output instead of the text you posted. Given that you're new, you should definitely read how to make a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example?noredirect=1&lq=1) – cimentadaj Oct 21 '16 at 02:57
  • Thanks a lot for the reference. It did help and I have updated the question with output from dput(). Hope this works better....I did subset to 20 observations but I can put the complete dataset here if needed. – Mavs18 Oct 22 '16 at 04:52

3 Answers3

8

The vif function from the VIF package does not estimates the Variance Inflation Factor(VIF). "It selects variables for a linear model" and "returns a subset of variables for building a linear model."; see here for the description.

What you want is the vif function from the car package.

install.packages("car")
library(car)
vif(model2) # This should do it

Edit: I won't comment specifically on the statistics side, but it seems like you have a perfect fit, something quite unusual, suggesting some problem in your data.

cimentadaj
  • 1,414
  • 10
  • 23
  • Your solution worked, thanks for all your help! Yes, the idea of this exercise was to understand the concept of multicollinearity, removing variables and then improving model performance. – Mavs18 Oct 23 '16 at 06:42
2

You're giving vif the wrong input. It wants the response y and predictor variables x:

vif(train$Volume,subset(train,select=-Volume),subsize=19)

I had to set the subsize argument to a value <= the number of observations (the default is 200).

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks for your reply Ben! I did try the code and the out put I am getting is probably the VIFs of every observation in the dataset. I also do not see all predictors in the output. I just saw 1 Star Review and one other predictor. – Mavs18 Oct 22 '16 at 04:54
  • What I am hoping to get is a vif of every predictor so I can remove the ones with high vif and run the model again to check performance. I am new to R and am revisiting my knowledge of statistics as well. Not sure if vif is the best way to go about it but I still want to try to run vif and see if I can improve my results. – Mavs18 Oct 22 '16 at 05:02
2

There are 2 R libraries "car" and "VIF" which have the same function vif() defined differently. Your result/error depends on which package you have loaded in your current session.

If you use "VIF" library in the session and pass the linear model as parameter to the vif() function then you will get the error given in the initial query, as shown below:

> model1 = lm(Satisfaction~., data1)
> library(VIF)

Attaching package: ‘VIF’

The following object is masked from ‘package:car’:

 vif

> vif(model1)
Error in as.vector(y) - mean(y) : non-numeric argument to binary operator
In addition: Warning message:
In mean.default(y) : argument is not numeric or logical: returning NA

If you load "car" library in R session and not "VIF", then you will get the vif numbers as expected for a linear model as shown below:

> model1 = lm(Satisfaction~., data1)
> library(car)
Loading required package: carData

Attaching package: ‘car’

The following object is masked from ‘package:psych’:

    logit

> vif(model1)
   ProdQual        Ecom     TechSup     CompRes Advertising    ProdLine SalesFImage  ComPricing 
   1.635797    2.756694    2.976796    4.730448    1.508933    3.488185    3.439420    1.635000 
 WartyClaim  OrdBilling    DelSpeed 
   3.198337    2.902999    6.516014 

All the columns in data1 are numeric. Hope that helps

Amit Jain
  • 46
  • 3