0

I would like to compare models (multiple regression, LASSO, Ridge, GBM) in terms of the importance of variables. But I'm not sure if the procedure is correct, because the values ​​obtained are not on the same scale.

In multiple regression and GBM values ​​range from 0 - 100 using varImp from the caret package. The calculation of this statistic is distinct in each of the methods.

Linear Models: the absolute value of the t-statistic for each model parameter is used.

Boosted Trees: this method uses the same approach as a single tree, but sums the importance of each boosting iteration.

While for LASSO and Ridge the values ​​are from 0.00 - 0.99, calculated with the function:

varImp <- function (object, lambda = NULL, ...) {
  beta <- predict (object, s = lambda, type = "coef")
  if (is.list (beta)) {
    out <- do.call ("cbind", lapply (beta, function (x)
      x [, 1])))
    out <- as.data.frame (out)
  } else
    out <- data.frame (Overall = beta [, 1])
  out <- abs (out [rownames (out)! = "(Intercept)",, drop = FALSE])
  out
}

Which was obtained here: Caret package - glmnet variable importance

I was guided by other questions on the forum, but could not understand why there is the difference between the scales. How can I make these measurements comparable?

j_3265
  • 207
  • 3
  • 12
  • 2
    This is probably a more general question about comparing statistical models that would better be asked at [stats.se]. Otherwise, at least least include a minimal [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and your specific modeling functions so possible solutions can be tested and verified. But in general it's not usually advisable to directly compare results like that across completely different model types. – MrFlick Jan 07 '20 at 21:33
  • Thanks MrFlick, I will change the question! – j_3265 Jan 07 '20 at 22:34

1 Answers1

1

If the goal is simply to compare them side-by-side, then what matters is creating a scale that they can all inhabit together, and sorting them.

You can accomplish this by creating a standardized scale, and coercing all of your VarImps to the new consistent scale, in this case 0 to 100.


importance_data <- c(-23,12, 32, 18, 45, 1, 77, 18, 22)

new_scale <- function(x){
    y =((100-0)/(max(x) -min(x))*(x-max(x))+100)
    sort(y)
    }

new_scale(importance_data)


#results
[1]   0  24  35  41  41  45  55  68 100

This will give you a uniform scale. And it does not mean that 22 in one scale is exactly the same as a 22 in another scale. But for relative comparison, any scale will do.

This will give you a standardized sense of the separation between the importance of each variable in its own model and you can evaluate them side-by-side more easily based on the relativity of the scaled importances.

sconfluentus
  • 4,693
  • 1
  • 21
  • 40
  • Got it, thanks for your reply. I think this is the only means for a possible comparison. Thanks – j_3265 Jan 11 '20 at 15:27
  • They model types are different enough that I think you are right. Just be careful about 'grading' them using the absolute numbers. – sconfluentus Jan 11 '20 at 18:23