1

I have a data frame and I need to run 6 2-variable linear models for each group 'site'. Then, I need to convert the results to a data frame. The second variable in the linear model changes. I have that part down using lapply(), but I can't figure out how to run by groups. I've found answers on SO that answer parts of my question, but I can't figure out how to put it all together.

Here's some data:

structure(list(SiteName = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L), .Label = c("bp10", "bp12"), class = "factor"), 
    DMWT = c(13.9697916666667, 13.9125, 14.2152083333333, 14.7810416666667, 
    15.1541666666667, 15.7535416666667, 17.3254166666667, 18.4872916666667, 
    20.0564583333333, 21.0595833333333, 21.3925), DMAT = c(16.6714631359947, 
    18.474493439025, 20.9517661662977, 23.7017661662978, 25.5957055602372, 
    20.9688840743375, 23.7188840743375, 25.6128234682769, 27.5143386197921, 
    27.6279749834285, 26.1355507410042), ADD = c(0, 0, 0, 1.90367965367967, 
    5.70129870129876, 0, 1.90367965367967, 5.70129870129876, 
    11.400432900433, 17.2132034632037, 21.53354978355), Air200 = c(7.3229782875097, 
    7.40616010569152, 7.50025101478243, 7.63384949963092, 7.78642525720668, 
    7.51736892282216, 7.65096740767065, 7.80354316524641, 7.97854316524641, 
    8.14729316524641, 8.29592952888278), Air100 = c(15.2711601056916, 
    15.362599499631, 15.512902529934, 15.727296469328, 15.9717661662977, 
    15.5300204379738, 15.7444143773677, 15.9888840743374, 16.2306264985798, 
    16.4472174076707, 16.6433537713071), Air75 = c(16.8986348531664, 
    17.0426752572068, 17.1927762673078, 17.3687358632674, 17.5567156612472, 
    17.2098941753475, 17.3858537713071, 17.5738335692869, 17.7820153874687, 
    18.0100961955496, 18.2532275086809), Air50 = c(19.5072207117523, 
    19.6340388935705, 19.7382813178129, 19.8887358632674, 20.1060085905402, 
    19.7553992258526, 19.9058537713072, 20.1231264985799, 20.4400961955496, 
    20.7669143773678, 20.9841871046405), Air10 = c(21.9214631359947, 
    21.5850994996311, 21.2563116208432, 21.1714631359947, 21.4502510147826, 
    21.2734295288829, 21.1885810440344, 21.4673689228223, 21.9696416500951, 
    22.3779749834284, 22.5476719531254)), .Names = c("SiteName", 
"DMWT", "DMAT", "ADD", "Air200", "Air100", "Air75", "Air50", 
"Air10"), row.names = c(547L, 548L, 549L, 550L, 551L, 1593L, 
1594L, 1595L, 1596L, 1597L, 1598L), class = "data.frame")

Here's the code for using each variable in a model. How do I use the sites?:

siteslist <- unique(d$SiteName) 
varlist <- names(d)[4:9]
models <- lapply(varlist, function(x) {  # apply the modeling function to our list of air variables
  lm(substitute(DMWT ~ DMAT + i, list(i = as.name(x))), data = d)  # linear model with air variable substituted
})

Then get model results & convert to a data frame:

library(relaimpo)
sumfun <- function(x) c(coef(x),
                        summary(x)$adj.r.squared,
                        sqrt(mean(resid(x)^2,na.rm=TRUE)),
                        calc.relimp(x,type="betasq")$betasq[1],
                        calc.relimp(x,type="betasq")$betasq[2],
                        calc.relimp(x,type="pratt")$pratt[1],
                        calc.relimp(x,type="pratt")$pratt[2])
mod.df <- as.data.frame(t(sapply(models,sumfun)))

Also tried combining variables and sites to do something like this:

siteslist <- unique(d$SiteName)                              
varlist <- names(d)[4:9]
sets <- expand.grid(SiteName = siteslist, Var = varlist)
models <- lapply(1:nrow(sets), function(x) {  # apply the modeling function to our list of air variables
  lm(substitute(DMWT ~ DMAT + i, list(i = as.name(sets$Var[x]))), data = d[d$SiteName ==  sets$SiteName[x],])  # linear model with air variable substituted
})

...but I get an error "Error in eval(expr, envir, enclos) : object '1' not found"

notacodr
  • 129
  • 2
  • 13
  • 1
    Take a subset of your data for one site only. Turn your working code into a function `fit_one_site` that depends on only the input `dat` and returns whatever you want (maybe a list with the models and the data frame? Maybe just the data frame?). Then you can `lapply(split(d, d$SiteName), FUN = fit_one_site)` to get your results. – Gregor Thomas Feb 24 '17 at 18:46
  • OK @Gregor. I think that works for the modeling part. Now, how can I use a version of my sumfun to pull out the model results? `mod.df <- as.data.frame(t(sapply(models,sumfun)))` gives me the error `$ operator is invalid for atomic vectors`. – notacodr Feb 24 '17 at 18:57
  • This is what @Gregor suggested: `fit_one_site <- function(x){ lapply(varlist, function(x) { lm(substitute(DMWT ~ DMAT + i, list(i = as.name(x))), data = d) }) } models <- lapply(split(d, d$SiteName), FUN = fit_one_site)` – notacodr Feb 24 '17 at 19:19
  • It's not clear from the question whether you want the output for a single site to include both the model objects *and* the `sumfun` result, or just the `sumfun` result. Either way, make the `fit_one_site` function return what you want. It can `return(lapply(model_list, sumfun))` or it can `return(list(model_list, mod.df = lapply(model_list, sumfun)))`. – Gregor Thomas Feb 24 '17 at 20:45
  • Just the `sumfun` result is fine. Forgive my daftness, but I'm rather new to function writing syntax. Could you show where exact your `return...` line would go in the `fit_one_site` function? Where does `model_list` come from? – notacodr Feb 24 '17 at 21:22

2 Answers2

1

I am not sure if this is exactly what you are trying to do, but the data.table plyr package allows you to run models split by multiple variables. Below is an example, with var1 and var2 simply representing two variables you want each combination of values to be modeled separately.

#load packages
library(data.table)
library(plyr)

#break up by variables, then fit the model to each piece
models <- dlply(data, c("var1","var2"),
              function(data)
                lm(DV ~ 
                     IV1 + IV2
                   , data = data, weights = weights))
#apply coef to eah model and return a df
models_coef <- ldply(models, coef)
#print summary
l_ply(models_coef, summary, .print = T)
juliamm2011
  • 136
  • 3
  • I need to split a data frame by groups, then on each group run models that would substitute in a different variable in the `"IV2"` position. Don't think this will tackle the group problem. – notacodr Feb 24 '17 at 20:12
1

This is how I would do it. Note this is untested as I haven't installed relaimpo. I'm really just re-packaging your code.

The general method is 1. develop a function that works on one group 2. use split to divide your data into groups 3. use lapply to apply the function to each group 4. (if needed) combine the results together

First test on a one-site subset of data:

The only changes I made are (a) to pull out a subset of data for one site and name it one_site. (b) to use one_site in your modeling code. (c) I prefer pasting a formula together as a string to using substitute, so I made that change. (d) White space and formatting for readability (mostly using RStudio's "reformat code").

## set up
varlist <- names(d)[4:9]
library(relaimpo)
sumfun <- function(x) {
    c(
        coef(x),
        summary(x)$adj.r.squared,
        sqrt(mean(resid(x) ^ 2, na.rm = TRUE)),
        calc.relimp(x, type = "betasq")$betasq[1],
        calc.relimp(x, type = "betasq")$betasq[2],
        calc.relimp(x, type = "pratt")$pratt[1],
        calc.relimp(x, type = "pratt")$pratt[2]
    )
}

## Testing: this works for one_site
one_site <- subset(d, SiteName == "bp10")

models <- lapply(varlist, function(x) {  # apply the modeling function to our list of air variables
    form <- as.formula(sprintf("DMWT ~ DMAT + %s", x))
    lm(form, data = one_site)  # linear model with air variable substituted
})

## desired result
mod.df <- as.data.frame(t(sapply(models, sumfun)))

Turn it into a function

Once you have code that works for a single site, we turn it into a function. The only inputs seem to be the data for one site and the variables in varlist. Instead of assigning the result at the bottom, we return it:

fit_one_site = function(one_site, varlist) {
    models <- lapply(varlist, function(x) {
            # apply the modeling function to our list of air variables
            form = as.formula(sprintf("DMWT ~ DMAT + %s", x))
            lm(form, data = one_site)  # linear model with air variable substituted
    })
    return(as.data.frame(t(sapply(models, sumfun))))
}

Now we can use split to split your data up by SiteName, and lapply to apply the fit_one_site function to each piece.

results = lapply(split(d, d$SiteName), FUN = fit_one_site, varlist = names(d)[4:9])

The results should be list of data frames, one for each site. If you want to combine them into one data frame, see the relevant part of my answer at the list of data frames R-FAQ.

Community
  • 1
  • 1
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294