Fit regression model to multiple independent and dependent variables and obtain separate fits by grouping variable

Question

Lets try this again....

Using the mtcars dataset I want to fit a nonlinear regression model to multiple dependent and independent variables using the same model. Lets say I want to use variables disp, hp, and wt to explain mpg and drat. After fitting the model I want to calculate total sums of squares and residual sums of squares and store them in a matrix. This can be done the long way by...

dt <- data.frame(mtcars)
m1 <- nls(mpg ~ B0*(disp^B1)*exp(B2*disp), data=dt, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m2 <- nls(mpg ~ B0*(hp^B1)*exp(B2*hp), data=dt, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m3 <- nls(mpg ~ B0*(wt^B1)*exp(B2*wt), data=dt, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m4 <- nls(drat ~ B0*(disp^B1)*exp(B2*disp), data=dt, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m5 <- nls(drat ~ B0*(hp^B1)*exp(B2*hp), data=dt, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m6 <- nls(drat ~ B0*(wt^B1)*exp(B2*wt), data=dt, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
TSS.mpg <- sum((dt$mpg - mean(dt$mpg))^2)
TSS.drat <- sum((dt$drat - mean(dt$drat))^2)
RSS.m1 <- sum(residuals(m1)^2)
RSS.m2 <- sum(residuals(m2)^2)
RSS.m3 <- sum(residuals(m3)^2)
RSS.m4 <- sum(residuals(m4)^2)
RSS.m5 <- sum(residuals(m5)^2)
RSS.m6 <- sum(residuals(m6)^2)

sumsqu <- matrix(0,6,2)
sumsqu[1:3,1] <- TSS.mpg
sumsqu[4:6,1] <- TSS.drat
sumsqu[1,2] <- RSS.m1
sumsqu[2,2] <- RSS.m2
sumsqu[3,2] <- RSS.m3
sumsqu[4,2] <- RSS.m4
sumsqu[5,2] <- RSS.m5
sumsqu[6,2] <- RSS.m6

So, the final result is a matrix with the 1st column as the total sums of squares and the 2nd as residual sums of squares. Now, lets make this more complicated by including a grouping factor. I want to do the same model fitting and SS extraction but for two groups based on the variable "am" where am=0 or 1. The final result will be a matrix similar to that in part 1 but with four columns, the first 2 columns for am=0 and the second 2 columns for am=1. Again this can be done the long way by...

#subset the data (am = 0) and refit models
dt0 <- subset(dt, am == 0)
m1.0 <- nls(mpg ~ B0*(disp^B1)*exp(B2*disp), data=dt0, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m2.0 <- nls(mpg ~ B0*(hp^B1)*exp(B2*hp), data=dt0, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m3.0 <- nls(mpg ~ B0*(wt^B1)*exp(B2*wt), data=dt0, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m4.0 <- nls(drat ~ B0*(disp^B1)*exp(B2*disp), data=dt0, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m5.0 <- nls(drat ~ B0*(hp^B1)*exp(B2*hp), data=dt0, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m6.0 <- nls(drat ~ B0*(wt^B1)*exp(B2*wt), data=dt0, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
TSS.mpg.0 <- sum((dt0$mpg - mean(dt0$mpg))^2)
TSS.drat.0 <- sum((dt0$drat - mean(dt0$drat))^2)
RSS.m1.0 <- sum(residuals(m1.0)^2)
RSS.m2.0 <- sum(residuals(m2.0)^2)
RSS.m3.0 <- sum(residuals(m3.0)^2)
RSS.m4.0 <- sum(residuals(m4.0)^2)
RSS.m5.0 <- sum(residuals(m5.0)^2)
RSS.m6.0 <- sum(residuals(m6.0)^2)

sumsqu.0 <- matrix(0,6,2)
sumsqu.0[1:3,1] <- TSS.mpg.0
sumsqu.0[4:6,1] <- TSS.drat.0
sumsqu.0[1,2] <- RSS.m1.0
sumsqu.0[2,2] <- RSS.m2.0
sumsqu.0[3,2] <- RSS.m3.0
sumsqu.0[4,2] <- RSS.m4.0
sumsqu.0[5,2] <- RSS.m5.0
sumsqu.0[6,2] <- RSS.m6.0

#subset the data (am=1) and refit models
dt1 <- subset(dt, am == 1)
m1.1 <- nls(mpg ~ B0*(disp^B1)*exp(B2*disp), data=dt1, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m2.1 <- nls(mpg ~ B0*(hp^B1)*exp(B2*hp), data=dt1, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m3.1 <- nls(mpg ~ B0*(wt^B1)*exp(B2*wt), data=dt1, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m4.1 <- nls(drat ~ B0*(disp^B1)*exp(B2*disp), data=dt1, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m5.1 <- nls(drat ~ B0*(hp^B1)*exp(B2*hp), data=dt1, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
m6.1 <- nls(drat ~ B0*(wt^B1)*exp(B2*wt), data=dt1, start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
TSS.mpg.1 <- sum((dt1$mpg - mean(dt1$mpg))^2)
TSS.drat.1 <- sum((dt1$drat - mean(dt1$drat))^2)
RSS.m1.1 <- sum(residuals(m1.1)^2)
RSS.m2.1 <- sum(residuals(m2.1)^2)
RSS.m3.1 <- sum(residuals(m3.1)^2)
RSS.m4.1 <- sum(residuals(m4.1)^2)
RSS.m5.1 <- sum(residuals(m5.1)^2)
RSS.m6.1 <- sum(residuals(m6.1)^2)

sumsqu.1 <- matrix(0,6,2)
sumsqu.1[1:3,1] <- TSS.mpg.1
sumsqu.1[4:6,1] <- TSS.drat.1
sumsqu.1[1,2] <- RSS.m1.1
sumsqu.1[2,2] <- RSS.m2.1
sumsqu.1[3,2] <- RSS.m3.1
sumsqu.1[4,2] <- RSS.m4.1
sumsqu.1[5,2] <- RSS.m5.1
sumsqu.1[6,2] <- RSS.m6.1

#combine sumsqu.1 and sumsqu.0
allSS <- cbind(sumsqu.0,sumsqu.1)
allSS

As you can see the process gets rather lengthy the way I know how to do. Now picture that my real problem has 6 dependent variables, 7 independent variables, 5 groups, and extracting 10 or so numbers from each fit. From my code you can see that I am not a programmer as my method is highly inefficient.I thought that I could include some kind of function and then use some apply function, for example..

nls1 <- function(x,y){
m1 <- nls( y ~ B0*(x^B1)*exp(B2*x), data=dt0, start=c(B0 = 3.5, B1 = 0.2, B2 = 0.0007))
RSS <- sum(residuals(m1)^2)
TSS <- sum((y - mean(y))^2)
RSS
TSS
}

Any help making this process more efficient is much appreciated.

"I have no reproducible code but my problem is actually much larger than this" but you did attempt to describe a minimal version of your problem, which is good! Now just make the minimal version reproducible! It's extremely hard to imagine any good answer to this question that doesn't show code on a reproducible example. [See here for reproducibility tips](http://stackoverflow.com/q/5963269/903061). — Gregor Thomas, Sep 02 '15 at 22:26
Do you anticipate that some of the explanatory power of ind vars `b1 - b4` is relevant across dep vars `a1 - a3`? If so, then this approach is not statistically valid. Instead, you should have a single model which has the 12 models you describe in it.

Regardless, of course, it would be much nicer to see a reproducible example with dummy data. — alexwhitworth, Sep 02 '15 at 23:33

AntoniosK · Accepted Answer · 2015-09-08T11:16:35.880

Here I'm using 2 dependent variables (drat,mpg), 3 independent variables (disp,hp,wt) and 1 grouping variable with 2 levels/classes (am as 1/0).

library(dplyr)
library(tidyr)

# example dataset (picking useful columns)
dt <- data.frame(mtcars) %>% select(drat, mpg, disp, hp, wt, am)

# specify which columns we want as y (dependent) and x (independent)
# grouping variable is specified within the dependent variables
ynames <- c("drat","mpg","am")
xnames <- c("disp","hp","wt")

# create and reshape datasets
dt1 <- dt[,ynames]
dt1 <- gather(dt1, y, yvalue, -am)

dt2 <- dt[,xnames]
dt2 <- gather(dt2, x, xvalue)


dt1 %>% 
  group_by(y) %>%                 
  do(data.frame(.,dt2)) %>%
  group_by(y,x,am) %>% 
  do({ m1 <- nls( yvalue ~ B0*(xvalue^B1)*exp(B2*xvalue), data=., start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
  RSS <- sum(residuals(m1)^2)
  TSS <- sum((.$yvalue - mean(.$yvalue))^2)
  data.frame(RSS,TSS) })

#       y    x am         RSS        TSS
# 1  drat disp  0   1.3090406   2.770242
# 2  drat disp  1   1.1155372   1.590400
# 3  drat   hp  0   2.1707337   2.770242
# 4  drat   hp  1   0.8342527   1.590400
# 5  drat   wt  0   2.2100162   2.770242
# 6  drat   wt  1   1.1885811   1.590400
# 7   mpg disp  0  98.4815286 264.587368
# 8   mpg disp  1  46.8674036 456.309231
# 9   mpg   hp  0  74.9295161 264.587368
# 10  mpg   hp  1 112.5548955 456.309231
# 11  mpg   wt  0 104.2894519 264.587368
# 12  mpg   wt  1  71.1402536 456.309231

As you can see the method above reshapes data and creates a bigger dataset with all y and x variable combinations needed. You might have problems if you end up having a huge dataset. Or maybe someone else who's having a similar problem needs to deal with variables with big lengths and creating that big data set creates issues.

It's better to create the formula we need for each model fit instead of creating combinations of variables. This approach is similar to what @BondedDust suggested below.

library(dplyr)

# example dataset (picking useful columns)
dt <- data.frame(mtcars) %>% select(drat, mpg, disp, hp, wt, am)

# specify which columns we want as y (dependent) and x (independent)
ynames <- c("drat","mpg")
xnames <- c("disp","hp","wt")

# get unique values of the grouping variable am
groupvalues = unique(dt$am)


expand.grid(ynames,xnames,groupvalues) %>%
  data.frame() %>%
  select(y=Var1, x=Var2, group=Var3) %>%
  mutate(formula = paste0(y," ~ B0*(",x,"^B1)*exp(B2*",x,")")) %>%
  group_by(y,x,group,formula) %>%
  do({ m1 <- nls( .$formula, data=dt[dt$am==.$group,], 
                  start=c(B0 = 45, B1 = 0.2, B2 = 0.0007))
  RSS <- sum(residuals(m1)^2)
  TSS <- sum((dt[dt$am==.$group,][,.$y]- mean(dt[dt$am==.$group,][,.$y]))^2)
  data.frame(RSS,TSS) })


#       y    x group                          formula         RSS        TSS
# 1  drat disp     0 drat ~ B0*(disp^B1)*exp(B2*disp)   1.3090406   2.770242
# 2  drat disp     1 drat ~ B0*(disp^B1)*exp(B2*disp)   1.1155372   1.590400
# 3  drat   hp     0     drat ~ B0*(hp^B1)*exp(B2*hp)   2.1707337   2.770242
# 4  drat   hp     1     drat ~ B0*(hp^B1)*exp(B2*hp)   0.8342527   1.590400
# 5  drat   wt     0     drat ~ B0*(wt^B1)*exp(B2*wt)   2.2100162   2.770242
# 6  drat   wt     1     drat ~ B0*(wt^B1)*exp(B2*wt)   1.1885811   1.590400
# 7   mpg disp     0  mpg ~ B0*(disp^B1)*exp(B2*disp)  98.4815286 264.587368
# 8   mpg disp     1  mpg ~ B0*(disp^B1)*exp(B2*disp)  46.8674036 456.309231
# 9   mpg   hp     0      mpg ~ B0*(hp^B1)*exp(B2*hp)  74.9295161 264.587368
# 10  mpg   hp     1      mpg ~ B0*(hp^B1)*exp(B2*hp) 112.5548955 456.309231
# 11  mpg   wt     0      mpg ~ B0*(wt^B1)*exp(B2*wt) 104.2894519 264.587368
# 12  mpg   wt     1      mpg ~ B0*(wt^B1)*exp(B2*wt)  71.1402536 456.309231

This looks like something I need to do. I've added an example that may help better understand my problem.Thanks — Micky, Sep 03 '15 at 15:16
I've updated my answer, but based on my process. I feel like you have to create all combinations of dependent, independent and grouping variables if you want to do it that way. I'll have a look. — AntoniosK, Sep 03 '15 at 15:50
There are some issues with what you posted. You use dt1 before you create it. I'll update my answer to include dependent, independent and grouping variable and we can compare our approaches to see if we agree. — AntoniosK, Sep 03 '15 at 16:41
You are correct. I fixed that mistake in the code so that it works. — Micky, Sep 04 '15 at 05:06
This looks exactly like what I need, however, the values for RSS and TSS are not the same with your code and mine. — Micky, Sep 04 '15 at 05:14
I see that disp is used as a dependent variable(y) but I don't know why. — Micky, Sep 04 '15 at 05:21
Glad I've helped. If you understood how the process works you'll see that maybe if you have to deal with huge datasets, and many dependent/independent variables you might have to create an even larger dataset of the combinations. I don't know how the process will perform. Much slower? Memory limitations to store the new dataset? Might need to consider a loop, or a process that doesn't create a big dataset of combination. But, until then this process will do the job! — AntoniosK, Sep 04 '15 at 14:29
I was able to scale this to my dataset perfectly. The process takes seconds whereas my way would take days! Thanks again. — Micky, Sep 04 '15 at 15:01

score 1 · Answer 2 · answered Sep 02 '15 at 23:36

You can try something like:

vars <- expand.grid( Y = c('a1','a2','a3'), X=c('b1','b2','b3','b4'))
models_list <-  lapply( apply(vars, 1, 
                                 function(x) as.formula(paste(x[1], x[2], sep= "~") ) ),   
                        function(form) summary(lm(form=form, data= your_df) )
                        )

It's possible that substituting aov for lm may give you something more to your liking. See ?summary.aov and play with the examples there to see which of the components are needed for whatever your final purposes might be.

Fit regression model to multiple independent and dependent variables and obtain separate fits by grouping variable

2 Answers2