2

I've written a function to take the coefficients of a (linear) model and apply them onto the original variables to give a data frame of terms which, when added together, will be equivalent to the result of predict(). This ability seems useful to me order to better understand the influence each variable (or more complex interaction term, etc) has on the model.

Is there a better way? I feel like a hack. I've looked into the str() of models, and don't see a simpler solution as of yet. Tricky part is to catch and apply interaction terms.

library(plyr)
nospredict = function(model, data = model$model, sorted = TRUE) {  # model is model (from lm, glm...), data is data.frame to be applied to                                                                    
    c = coef(model)  # model must support coef()                                                                                                                                                              
    my.names = names(c) =  gsub(':', '*', names(c) )  # ':' equals multiplication in formulas, coefs                                                                                                          
    data = data[ , colnames(data) %in% my.names] # don't do the attach() below with a zillion variables...                                                                                                    
    final.out = adply(data, 1, function(y) { # did I mention I like plyr?                                                                                                                                     
        attach(as.list(y), warn.conflicts = FALSE) # so you can do eval algebra blackRmagic                                                                                                                   
        out = ldply(my.names, function (x) { # did I mention...                                                                                                                                               
            Intercept = 1  # (Intercept) from model is a constant, multiply it by 1                                                                                                                           
            eval(  parse(  text = paste( c[x], "*", x )  )  )       }) # blackRmagic                                                                                                                          
        out = as.data.frame(t(out)) ; colnames(out) = my.names ; out
    })
    rownames(final.out) = rownames(data)
    final.out$Predict = predict(model, data) ## add predict() as column                                                                                                                                       
    if ( sorted ) {
        final.out[order(final.out$Predict), ]  ## return df sorted by predict()                                                                                                                               
    }
    final.out
}
set.seed(10538)
df = data.frame(a = 1:10, b = rnorm(10), c = 1:10 + rnorm(10) )
lmf = lm( c ~ a * b, data = df)

> df
a           b         c
1   1 -0.41184664 1.3739709
2   2  1.06464586 0.8975101
3   3 -0.07522363 3.4910425
4   4  1.21643049 2.8856876
5   5  0.34061917 4.3851439
6   6 -1.00020786 6.1836535
7   7 -0.36954963 6.4734150
8   8 -1.47754640 6.8150569
9   9 -0.19312147 9.6432687
10 10  2.32220098 9.4276813


> nospredict(lmf)
(Intercept)         a           b         a*b   Predict
1   0.09801818 0.9282185  0.48332671 -0.05438652 1.4551769
2   0.09801818 1.8564370 -1.24942570  0.28118420 0.9862137
3   0.09801818 2.7846555  0.08827944 -0.02980103 2.9411521
4   0.09801818 3.7128740 -1.42755405  0.64254425 3.0258824
5   0.09801818 4.6410925 -0.39973700  0.22490279 4.5642765
6   0.09801818 5.5693110  1.17380385 -0.79249635 6.0486367
7   0.09801818 6.4975295  0.43368863 -0.34160685 6.6876294
8   0.09801818 7.4257480  1.73398922 -1.56094237 7.6968130
9   0.09801818 8.3539665  0.22663962 -0.22952439 8.4490999
10  0.09801818 9.2821850 -2.72524198  3.06658890 9.7215501
Nathan Siemers
  • 423
  • 4
  • 8
  • final sort by predict() values not working here, trivial fix, not important to the question. – Nathan Siemers Mar 21 '16 at 05:28
  • Of course, the reproducible example above is only that. I'm looking for a solution that is generalizable across most or all model formulas (without knowing model had 'a' and 'b' in it, etc). – Nathan Siemers Mar 22 '16 at 02:42

1 Answers1

2
junk <- data.frame( x1 = rnorm(100), x2 = rnorm(100))

junk$YY <- 2 * junk$x1 + 4 * junk$x2 + 6 * junk$x1 * junk$x2 + 7 + rnorm(100)

out <- lm(YY ~ x1 * x2, data = junk, x = TRUE) # The x = TRUE part is key!

head(out$x)
   (Intercept)          x1         x2        x1:x2
 1           1 -0.34885894 -0.8127228  0.283525629
 2           1 -0.04482498 -0.1601600  0.007179167
 3           1 -1.11721391  0.3266213 -0.364905892
 4           1 -0.08530188  0.3482372 -0.029705287
 5           1  0.19138684 -0.1659683 -0.031764149
 6           1 -1.89493717  1.0261454 -1.944481020

coef(out)
(Intercept)          x1          x2       x1:x2 
   7.053434    1.804441    4.130249    5.970553

nomThings <- t( t(out$x[, names(coef(out))]) * coef(out) )

I’m not entirely sure this will work correctly if you have some factors as independent variables, or that it will work correctly in ALL situations. But I suspect so.

You could, of course, save coef(out) as a temp object, and so on, to make the code a little more readable and slightly more efficient.

Given that you easily can do this, I would think hard about whether you should.

  • Thank you. x=TRUE in the lm() call is the magic option that provides a complete model matrix and makes life much easier and probably more reliable. Independent of the value of doing this, of course. – Nathan Siemers Mar 21 '16 at 18:43