With step(y ~ ., data = data)
in R I've identified the best multiple linear model to estimate a response (y
). I have 50 observations.
lm(formula = y ~ reservoir_storage + sewage_plants + mnq_kla_rel +
porous + complex + lu_urban + lu_forest + wat_prot_area_rel +
Q95Q50 + model, data = data)
All independent variables are indices (numeric or binary) but the predictor model
is the name of a best model I used before (i.e. LAY, PA2, LL1, LBY1, MAT
) - not important what the abbr. means here. Here is an example of some data:
area model Q95Q50 hydropower ... ... ...
<dbl> <chr> <dbl> <dbl>
1 169. LL1 0.454 0
2 88.8 LBY1 0.707 0
3 130. LBY1 0.605 0
4 80.6 LAY 0.322 0
5 53.9 LAY 0.595 1
6 110. LL1 0.415 1
7 107. LAY 0.544 0
8 47.2 LAY 0.412 0
9 49.0 LAY 0.355 0
10 43.2 PA2 0.216 1
With vi()
from vip
-package I calculated the importance (https://koalaverse.github.io/vip/reference/vi.html)
Variable Importance Sign
<chr> <dbl> <chr>
1 Q95Q50 7.06 POS
2 modelPA2 5.55 NEG
3 modelMAT 5.35 NEG
4 lu_urban 4.20 POS
5 mnq_kla_rel 4.03 NEG
6 modelLBY1 3.53 NEG
7 porous 2.32 POS
8 lu_forest 2.05 POS
9 wat_prot_area_rel 1.82 NEG
10 complex 1.75 POS
11 reservoir_storage 1.73 POS
12 sewage_plants 1.27 NEG
13 modelLL1 0.936 NEG
Although I wonder how to interpret the Importance values (I understand the Sign column), I have more problems to bind the model importance together. I get modelMAT
, modelPA2
but I want the importance of model
as a total like in the ANOVA table:
> fit %>% anova
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
reservoir_storage 1 0.000068 0.000068 0.0773 0.782553
sewage_plants 1 0.000945 0.000945 1.0697 0.307917
mnq_kla_rel 1 0.014368 0.014368 16.2627 0.000274 ***
porous 1 0.005891 0.005891 6.6674 0.014034 *
complex 1 0.006897 0.006897 7.8064 0.008291 **
lu_urban 1 0.009580 0.009580 10.8430 0.002229 **
lu_forest 1 0.000087 0.000087 0.0981 0.755980
wat_prot_area_rel 1 0.001442 0.001442 1.6318 0.209633
Q95Q50 1 0.059144 0.059144 66.9435 9.884e-10 ***
model 4 0.046172 0.011543 13.0654 1.138e-06 ***
Residuals 36 0.031805 0.000883
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Perhaps someone can help me with these questions:
- Why is
LAY
not listed asmodel
but all other model names? - Is there a way to summarise the importance of all different models?
- What is the best way to quantify the importance of different predictors, can I use the p-value for that in a relative manner?
- Has anyone experience in interpreting the Importance values and can give me a hint for that, e.g. what is the unit/or meaning of the Importance values?
Best+Thanks, Michael