0

With step(y ~ ., data = data) in R I've identified the best multiple linear model to estimate a response (y). I have 50 observations.

lm(formula = y ~ reservoir_storage + sewage_plants + mnq_kla_rel + 
porous + complex + lu_urban + lu_forest + wat_prot_area_rel + 
Q95Q50 + model, data = data)

All independent variables are indices (numeric or binary) but the predictor model is the name of a best model I used before (i.e. LAY, PA2, LL1, LBY1, MAT) - not important what the abbr. means here. Here is an example of some data:

    area model Q95Q50 hydropower ... ... ...
   <dbl> <chr>  <dbl>      <dbl>
 1 169.  LL1    0.454          0
 2  88.8 LBY1   0.707          0
 3 130.  LBY1   0.605          0
 4  80.6 LAY    0.322          0
 5  53.9 LAY    0.595          1
 6 110.  LL1    0.415          1
 7 107.  LAY    0.544          0
 8  47.2 LAY    0.412          0
 9  49.0 LAY    0.355          0
10  43.2 PA2    0.216          1

With vi() from vip-package I calculated the importance (https://koalaverse.github.io/vip/reference/vi.html)

   Variable          Importance Sign 
   <chr>                  <dbl> <chr>
 1 Q95Q50                 7.06  POS  
 2 modelPA2               5.55  NEG  
 3 modelMAT               5.35  NEG  
 4 lu_urban               4.20  POS  
 5 mnq_kla_rel            4.03  NEG  
 6 modelLBY1              3.53  NEG  
 7 porous                 2.32  POS  
 8 lu_forest              2.05  POS  
 9 wat_prot_area_rel      1.82  NEG  
10 complex                1.75  POS  
11 reservoir_storage      1.73  POS  
12 sewage_plants          1.27  NEG  
13 modelLL1               0.936 NEG  

enter image description here

Although I wonder how to interpret the Importance values (I understand the Sign column), I have more problems to bind the model importance together. I get modelMAT, modelPA2 but I want the importance of model as a total like in the ANOVA table:

> fit %>%  anova
Analysis of Variance Table

Response: y
                  Df   Sum Sq  Mean Sq F value    Pr(>F)    
reservoir_storage  1 0.000068 0.000068  0.0773  0.782553    
sewage_plants      1 0.000945 0.000945  1.0697  0.307917    
mnq_kla_rel        1 0.014368 0.014368 16.2627  0.000274 ***
porous             1 0.005891 0.005891  6.6674  0.014034 *  
complex            1 0.006897 0.006897  7.8064  0.008291 ** 
lu_urban           1 0.009580 0.009580 10.8430  0.002229 ** 
lu_forest          1 0.000087 0.000087  0.0981  0.755980    
wat_prot_area_rel  1 0.001442 0.001442  1.6318  0.209633    
Q95Q50             1 0.059144 0.059144 66.9435 9.884e-10 ***
model              4 0.046172 0.011543 13.0654 1.138e-06 ***
Residuals         36 0.031805 0.000883                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Perhaps someone can help me with these questions:

  • Why is LAY not listed as model but all other model names?
  • Is there a way to summarise the importance of all different models?
  • What is the best way to quantify the importance of different predictors, can I use the p-value for that in a relative manner?
  • Has anyone experience in interpreting the Importance values and can give me a hint for that, e.g. what is the unit/or meaning of the Importance values?

Best+Thanks, Michael

mod_che
  • 137
  • 1
  • 6
  • For the missing level of `model`, this explains it better than I can: https://stackoverflow.com/questions/41032858/lm-summary-not-display-all-factor-levels. If you only have one categoricial variable, try adding `-1` to the lm formula as the `LAY` effect is being pulled in to the intercept – Jonny Phelps Jun 10 '21 at 12:45
  • For linear models the importance is the absolute value of the t-statistic https://topepo.github.io/caret/variable-importance.html – mod_che Jun 15 '21 at 21:38

0 Answers0