1

I'm trying to predict the values of test data set based on train data set, it is predicting the values (no errors) however the predictions deviate A LOT by the original values. Even predicting values around -356 although none of the original values exceeds 200 (and there are no negative values). The warning is bugging me as I think the values deviates a lot because of this warning.

Warning message:
In predict.lm(fit2, data_test) :
  prediction from a rank-deficient fit may be misleading

any way I can get rid of this warning? the code is simple

fit2 <- lm(runs~., data=train_data)
prediction<-predict(fit2, data_test)
prediction

I searched a lot but tbh I couldn't understand much about this error. str of test and train data set in case someone needs them

> str(train_data)
'data.frame':   36 obs. of  28 variables:
 $ matchid                  : int  57 58 55 56 53 54 51 52 45 46 ...
 $ TeamName                 : chr  "South Africa" "West Indies" "South Africa" "West Indies" ...
 $ Opp_TeamName             : chr  "West Indies" "South Africa" "West Indies" "South Africa" ...
 $ TeamRank                 : int  4 3 4 3 4 3 10 7 5 1 ...
 $ Opp_TeamRank             : int  3 4 3 4 3 4 7 10 1 5 ...
 $ Team_Top10RankingBatsman : int  0 1 0 1 0 1 0 0 2 2 ...
 $ Team_Top50RankingBatsman : int  4 6 4 6 4 6 3 5 4 3 ...
 $ Team_Top100RankingBatsman: int  6 8 6 8 6 8 7 7 7 6 ...
 $ Opp_Top10RankingBatsman  : int  1 0 1 0 1 0 0 0 2 2 ...
 $ Opp_Top50RankingBatsman  : int  6 4 6 4 6 4 5 3 3 4 ...
 $ Opp_Top100RankingBatsman : int  8 6 8 6 8 6 7 7 6 7 ...
 $ InningType               : chr  "1st innings" "2nd innings" "1st innings" "2nd innings" ...
 $ Runs_OverAll             : num  361 705 348 630 347 ...
 $ AVG_Overall              : num  27.2 20 23.3 19.1 24 ...
 $ SR_Overall               : num  128 121 120 118 118 ...
 $ Runs_Last10Matches       : num  118.5 71 102.1 71 78.6 ...
 $ AVG_Last10Matches        : num  23.7 20.4 20.9 20.4 23.2 ...
 $ SR_Last10Matches         : num  120 106 114 106 116 ...
 $ Runs_BatingFirst         : num  236 459 230 394 203 ...
 $ AVG_BatingFirst          : num  30.6 23.2 24 21.2 27.1 ...
 $ SR_BatingFirst           : num  127 136 123 125 118 ...
 $ Runs_BatingSecond        : num  124 262 119 232 144 ...
 $ AVG_BatingSecond         : num  25.5 18.3 22.8 17.8 22.8 ...
 $ SR_BatingSecond          : num  125 118 112 117 114 ...
 $ Runs_AgainstTeam2        : num  88.3 118.3 76.3 103.9 49.3 ...
 $ AVG_AgainstTeam2         : num  28.2 23 24.7 22.1 16.4 ...
 $ SR_AgainstTeam2          : num  139 127 131 128 111 ...
 $ runs                     : int  165 168 231 236 195 126 143 141 191 135 ...
> str(data_test)
'data.frame':   34 obs. of  28 variables:
 $ matchid                  : int  59 60 61 62 63 64 65 66 69 70 ...
 $ TeamName                 : chr  "India" "West Indies" "England" "New Zealand" ...
 $ Opp_TeamName             : chr  "West Indies" "India" "New Zealand" "England" ...
 $ TeamRank                 : int  2 3 5 1 4 8 6 2 10 1 ...
 $ Opp_TeamRank             : int  3 2 1 5 8 4 2 6 1 10 ...
 $ Team_Top10RankingBatsman : int  1 1 2 2 0 0 1 1 0 2 ...
 $ Team_Top50RankingBatsman : int  5 6 4 3 4 2 5 5 3 3 ...
 $ Team_Top100RankingBatsman: int  7 8 7 6 6 5 7 7 7 6 ...
 $ Opp_Top10RankingBatsman  : int  1 1 2 2 0 0 1 1 2 0 ...
 $ Opp_Top50RankingBatsman  : int  6 5 3 4 2 4 5 5 3 3 ...
 $ Opp_Top100RankingBatsman : int  8 7 6 7 5 6 7 7 6 7 ...
 $ InningType               : chr  "1st innings" "2nd innings" "2nd innings" "1st innings" ...
 $ Runs_OverAll             : num  582 618 470 602 509 ...
 $ AVG_Overall              : num  25 21.8 20.3 20.7 19.6 ...
 $ SR_Overall               : num  113 120 123 120 112 ...
 $ Runs_Last10Matches       : num  182 107 117 167 140 ...
 $ AVG_Last10Matches        : num  37.1 43.8 21 24.9 27.3 ...
 $ SR_Last10Matches         : num  111 153 122 141 120 ...
 $ Runs_BatingFirst         : num  319 314 271 345 294 ...
 $ AVG_BatingFirst          : num  23.6 17.8 20.6 20.3 19.5 ...
 $ SR_BatingFirst           : num  116.9 98.5 118 124.3 115.8 ...
 $ Runs_BatingSecond        : num  264 282 304 256 186 ...
 $ AVG_BatingSecond         : num  28 23.7 31.9 21.6 16.5 ...
 $ SR_BatingSecond          : num  96.5 133.9 129.4 112 99.5 ...
 $ Runs_AgainstTeam2        : num  98.2 95.2 106.9 75.4 88.5 ...
 $ AVG_AgainstTeam2         : num  45.3 42.7 38.1 17.7 27.1 ...
 $ SR_AgainstTeam2          : num  125 138 152 110 122 ...
 $ runs                     : int  192 196 159 153 122 120 160 161 70 145 ...

In simple word, how can I get rid of this warning so that it doesn't effect my predictions?

(Intercept)                   matchid        TeamNameBangladesh 
            1699.98232628               -0.06793787               59.29445330 
          TeamNameEngland             TeamNameIndia       TeamNameNew Zealand 
             347.33030177             -499.40074338             -179.19192936 
         TeamNamePakistan      TeamNameSouth Africa         TeamNameSri Lanka 
            -272.71610614               -3.54867488              -45.27920191 
      TeamNameWest Indies    Opp_TeamNameBangladesh       Opp_TeamNameEngland 
            -345.54349798              135.05901017              108.04227770 
        Opp_TeamNameIndia   Opp_TeamNameNew Zealand      Opp_TeamNamePakistan 
            -162.24418387              -60.55364436             -114.74599364 
 Opp_TeamNameSouth Africa     Opp_TeamNameSri Lanka   Opp_TeamNameWest Indies 
             196.90856999              150.70170068               -6.88997714 
                 TeamRank              Opp_TeamRank  Team_Top10RankingBatsman 
                       NA                        NA                        NA 
 Team_Top50RankingBatsman Team_Top100RankingBatsman   Opp_Top10RankingBatsman 
                       NA                        NA                        NA 
  Opp_Top50RankingBatsman  Opp_Top100RankingBatsman     InningType2nd innings 
                       NA                        NA               24.24029455 
             Runs_OverAll               AVG_Overall                SR_Overall 
              -0.59935875               20.12721378              -13.60151334 
       Runs_Last10Matches         AVG_Last10Matches          SR_Last10Matches 
              -1.92526750                9.24182916                1.23914363 
         Runs_BatingFirst           AVG_BatingFirst            SR_BatingFirst 
               1.41001672               -9.88582744               -6.69780509 
        Runs_BatingSecond          AVG_BatingSecond           SR_BatingSecond 
              -0.90038727               -7.11580086                3.20915976 
        Runs_AgainstTeam2          AVG_AgainstTeam2           SR_AgainstTeam2 
               3.35936312               -5.90267210                2.36899131 
melissa
  • 375
  • 1
  • 8
  • 20
  • Check out http://stackoverflow.com/questions/26558631/predict-lm-in-a-loop-warning-prediction-from-a-rank-deficient-fit-may-be-mis and http://stats.stackexchange.com/questions/35071/what-is-rank-deficiency-and-how-to-deal-with-it for information about rank deficiency and linear regression. – Nick Becker Jul 14 '16 at 20:29
  • tbh I have already gone through these two link but I didn;t understand much :( – melissa Jul 14 '16 at 20:40
  • @ZheyuanLi thank you! I have edited the question above with 'fit2$coef' – melissa Jul 14 '16 at 21:00
  • Yeah the warning is gone now, but the prediction hasn't improved one bit :( also there are no NA's in these columns so why are they appearing here? – melissa Jul 14 '16 at 21:43

1 Answers1

0

You can have a look at this detailed discussion : predict.lm() in a loop. warning: prediction from a rank-deficient fit may be misleading

In general, multi-collinearity can lead to a rank deficient matrix in logistic regression. You can try applying PCA to tackle the multi-collinearity issue and then apply logistic regression afterwards.