0

I'm working with the data: (in RStudio version1.3.1056):

https://archive.ics.uci.edu/ml/machine-learning-databases/00397/LasVegasTripAdvisorReviews-Dataset.csv

My goal is to estimate the multiple linear regression model (Using the caret library). So I do the following:

vegas<- read.csv("LasVegasTripAdvisorReviews-Dataset.csv",
                       sep=";", header=T,stringsAsFactors = T)
 
head(vegas)
dim(vegas)
attach(vegas)
vegas.data <- cbind(vegas[,c(1:4)], vegas[,c(6:20)], Score)
head(vegas.data)
dim(vegas.data)
 
#missing values
library(mice)
md.pattern(vegas.data,plot=F)
 
#---------- Multiple Linear Regression  ---------------#
 
library(caret)
vegas.lm <- train(Score ~ ., data = vegas.data, method = "lm")
warnings()
 
summary(vegas.lm)

After running the line where vegas.lm is, I get this on the console:

> vegas.lm <- train(Score ~ ., data = vegas.data, method = "lm")
There were 25 warnings (use warnings() to see them)

And when I find out what the warnings are about, he tells me

> warnings()
Warning messages:
1: In predict.lm(modelFit, newdata) :
prediction from a rank-deficient fit may be misleading
2: In predict.lm(modelFit, newdata) :
prediction from a rank-deficient fit may be misleading
3: In predict.lm(modelFit, newdata) :
prediction from a rank-deficient fit may be misleading
4: In predict.lm(modelFit, newdata) :
prediction from a rank-deficient fit may be misleading
5:
6:
...

I hope you can help me decipher or explain why I get these warnings. Thank you in advance for your attention and support. Thanks a lot.

phiver
  • 23,048
  • 14
  • 44
  • 56
  • 3
    Some googling: https://stackoverflow.com/a/26560328/5221626 https://stats.stackexchange.com/questions/438126/r-help-prediction-from-a-rank-deficient-fit-may-be-misleading – Phil Aug 01 '20 at 20:03
  • 2
    See https://stackoverflow.com/a/30911235/7023826 for a way to deal with collinear predictors which might be causing the rank issue. Basically, the matrix can't be inverted to solve for the least squares estimates, thus no unique solution to least squares estimates. That's what it is telling you – MDEWITT Aug 01 '20 at 21:23

1 Answers1

0

What you have is rank deficiency, meaning you do not have enough information from the data to estimate each effects or coefficient. Multi-collinearity is when your predictors are correlated, but as long as two variables are not perfectly correlated, they can be estimated, though inaccurately.

Below I show you an example from the data what is meant by insufficient information, for example if we try to regress the score against hotel name and stars:

coefficients(lm(Score ~ Hotel.name+Hotel.stars,data=vegas.data))
                                                  (Intercept) 
                                                 4.208333e+00 
                                     Hotel.nameCaesars Palace 
                                                -8.333333e-02 
             Hotel.nameCircus Circus Hotel & Casino Las Vegas 
                                                -1.000000e+00 
                           Hotel.nameEncore at wynn Las Vegas 
                                                 3.333333e-01 
                           Hotel.nameExcalibur Hotel & Casino 
                                                -5.000000e-01 
             Hotel.nameHilton Grand Vacations at the Flamingo 
                                                -2.500000e-01 
            Hotel.nameHilton Grand Vacations on the Boulevard 
                                                -4.166667e-02 
                           Hotel.nameMarriott's Grand Chateau 
                                                 3.333333e-01 
                          Hotel.nameMonte Carlo Resort&Casino 
                                                -9.166667e-01 
                                    Hotel.nameParis Las Vegas 
                                                -1.666667e-01 
                         Hotel.nameThe Cosmopolitan Las Vegas 
                                                 4.166667e-02 
                                       Hotel.nameThe Cromwell 
                                                -1.250000e-01 
                    Hotel.nameThe Palazzo Resort Hotel Casino 
                                                 1.666667e-01 
                       Hotel.nameThe Venetian Las Vegas Hotel 
                                                 3.750000e-01 
            Hotel.nameThe Westin las Vegas Hotel Casino & Spa 
                                                -2.916667e-01 
                 Hotel.nameTreasure Island- TI Hotel & Casino 
                                                -2.500000e-01 
Hotel.nameTropicana Las Vegas - A Double Tree by Hilton Hotel 
                                                -1.666667e-01 
                Hotel.nameTrump International Hotel Las Vegas 
                                                 1.666667e-01 
                  Hotel.nameTuscany Las Vegas Suites & Casino 
                                                 3.657007e-15 
                               Hotel.nameWyndham Grand Desert 
                                                 1.666667e-01 
                                     Hotel.nameWynn Las Vegas 
                                                 4.166667e-01 
                                               Hotel.stars3,5 
                                                           NA 
                                                 Hotel.stars4 
                                                           NA 
                                               Hotel.stars4,5 
                                                           NA 
                                                 Hotel.stars5 
                                                           NA 

You can see that the coefficients for stars are all NAs, meaning it cannot be estimated. Why is this so? If we think about what we are doing, we need a hotel to be both 4 star and 3 star to estimate these two effects. In reality this is impossible, and we can see this if we tabulate the factors:

table(vegas.data$Hotel.name,vegas.data$Hotel.stars)
                                                     
                                                       3 3,5  4 4,5  5
  Bellagio Las Vegas                                   0   0  0   0 24
  Caesars Palace                                       0   0  0   0 24
  Circus Circus Hotel & Casino Las Vegas              24   0  0   0  0
  Encore at wynn Las Vegas                             0   0  0   0 24
  Excalibur Hotel & Casino                            24   0  0   0  0
  Hilton Grand Vacations at the Flamingo              24   0  0   0  0
  Hilton Grand Vacations on the Boulevard              0  24  0   0  0
  Marriott's Grand Chateau                             0  24  0   0  0
  Monte Carlo Resort&Casino                            0   0 24   0  0
  Paris Las Vegas                                      0   0 24   0  0
  The Cosmopolitan Las Vegas                           0   0  0   0 24
  The Cromwell                                         0   0  0  24  0
  The Palazzo Resort Hotel Casino                      0   0  0   0 24
  The Venetian Las Vegas Hotel                         0   0  0   0 24
  The Westin las Vegas Hotel Casino & Spa              0   0 24   0  0
  Treasure Island- TI Hotel & Casino                   0   0 24   0  0
  Tropicana Las Vegas - A Double Tree by Hilton Hotel  0   0 24   0  0
  Trump International Hotel Las Vegas                  0   0  0   0 24
  Tuscany Las Vegas Suites & Casino                   24   0  0   0  0
  Wyndham Grand Desert                                 0  24  0   0  0
  Wynn Las Vegas                                       0   0  0   0 24

So you can see 1 hotel name has only one specific star rating. If you need to regress, you can only choose either hotel name or hotel star, not both

Extending this to the model you want to construct, makes sense to check all your predictors, know what they are doing before proceeding

StupidWolf
  • 45,075
  • 17
  • 40
  • 72