How to us lapply or sapply for GLM on multiple species separately?

Question

I am trying to run a GLM on multiple different species in my data set. Currently I have been sub-setting my data for each species and copying this code and it's turned into quite the mess. I know there has to be a better way to do this, (maybe with the lapply function?) but I'm not sure how to begin with that.

I'm running the model on the CPUE (catch per unit effort) for a species and using Year, Salinity, Discharge, and Rainfall as my explanatory variables.

My data is here: https://drive.google.com/file/d/1_ylbMoqevvsuucwZn2VMA_KMNaykDItk/view?usp=sharing

This is the code that I have tried. It gets the job done, but I have just been copying this code and changing the species each time. I'm hoping to find a way to simplify this process and clean up my code a bit.

fish_df$pinfishCPUE <- ifelse(fish_df$Commonname == "Pinfish", fish_all$CPUE, 0)
#create binomial column
fish_df$binom <- ifelse(fish_df$pinfishCPUE > 0, 1,0)


glm.full.bin = glm(binom~Year+Salinity+Discharge +Rainfall,data=fish_df,family=binomial)
glm.base.bin = glm(binom~Year,data=fish_df,family=binomial)

#step to simplify model and get appropriate order
glm.step.bin = step(glm.base.bin,scope=list(upper=glm.full.bin,lower=~Year),direction='forward',
                    trace=1,k=log(nrow(fish_df)))

#final model - may choose to reduce based on deviance and cutoff in above step
glm.final.bin  = glm.step.bin
print(summary(glm.final.bin))

#calculate the LSMeans for the proportion of positive trips
lsm.b.glm = emmeans(glm.final.bin,"Year",data=fish_df)
LSMeansProp = summary(lsm.b.glm)

Output:

Call:
glm(formula = log.CPUE ~ Month + Salinity + Temperature, family = gaussian, 
    data = fish_B_pos)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.8927  -0.7852   0.1038   0.8974   3.5887  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.38530    0.72009   3.313  0.00098 ***
Month        0.10333    0.03433   3.010  0.00272 ** 
Salinity    -0.13530    0.01241 -10.900  < 2e-16 ***
Temperature  0.06901    0.01434   4.811  1.9e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 1.679401)

    Null deviance: 1286.4  on 603  degrees of freedom
Residual deviance: 1007.6  on 600  degrees of freedom
AIC: 2033.2

Number of Fisher Scoring iterations: 2

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Data should be included in the question rather than linked to at an external site. Perhaps make it clear what needs to change in the loop and make it clear what exactly you need to store from each iteration. — MrFlick, Aug 20 '20 at 21:51

score 0 · Accepted Answer · answered Aug 20 '20 at 21:56

I would suggest next approach creating a function for the models and then using lapply over a list which results from applying split() to the dataframe by variable Commonname:

library(emmeans)
#Load data
fish_df <- read.csv('fish_df.csv',stringsAsFactors = F)
#Code
List <- split(fish_df,fish_df$Commonname)
#Function for models
mymodelfun <- function(x)
{
  #Create binomial column
  x$binom <- ifelse(x$pinfishCPUE > 0, 1,0)
  
  
  glm.full.bin = glm(binom~Year+Salinity+Discharge +Rainfall,data=x,family=binomial)
  glm.base.bin = glm(binom~Year,data=x,family=binomial)
  
  #step to simplify model and get appropriate order
  glm.step.bin = step(glm.base.bin,scope=list(upper=glm.full.bin,lower=~Year),direction='forward',
                      trace=1,k=log(nrow(x)))
  
  #final model - may choose to reduce based on deviance and cutoff in above step
  glm.final.bin  = glm.step.bin
  print(summary(glm.final.bin))
  
  #calculate the LSMeans for the proportion of positive trips
  lsm.b.glm = emmeans(glm.final.bin,"Year",data=x)
  LSMeansProp = summary(lsm.b.glm)
  return(LSMeansProp)
}
#Apply function
Lmods <- lapply(List,mymodelfun)

In Lmods there will be the results of the models, here an example:

Lmods$`Atlantic Stingray`

Output:

 Year emmean    SE  df asymp.LCL asymp.UCL
 2009  -22.6 48196 Inf    -94485     94440

Results are given on the logit (not the response) scale. 
Confidence level used: 0.95

How to us lapply or sapply for GLM on multiple species separately?

1 Answers1