Applying user defined functions to a specific column of each dataframe in a list of dataframes?

Question

Background

Hi everyone! I am currently conducting a project which requires me to estimate within-study variance through bootstrapping of model residuals, and then calculating the SEE for each sample. This process has to be done on a model-by-model basis.

I have started off by creating a list of dataframes, which are split based on the factor variable model using the following code list.meta<- split(new.meta, new.meta$model) where each dataframe contains the data pertaining to a single model. I have supplied a reproducible example below and have limited this to 3 models; however my full dataset contains 13. From there I have two user defined functions: One for calculating the SEE, and another which generates 1000 bootstrapped samples, calculating the SEE for each using the previously defined SEE function. I have supplied both below as well for transparency.

User defined functions

#Define SEE function 
SEE<- function(x){
  sqrt((sum(x)/(length(x)-2))^2)
}

#Define function for generating bootstrap samples and calculating SEE for each sample

Bootstrap<- function(x){
  int<- lapply(1:1000, function(i) sample(x, replace = T))
  Calc.SEE<- sapply(int, SEE)
}

Where x is the Residuals column from a given dataframe 'i'

Data

list(`1` = structure(list(Study = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Model = c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), Residuals = c(26.96774194, 24.35483871, 15.74193548, 15.70967742, 
13.22580645, 12.87096774, 11.77419355, 10.67741935, 10.58064516, 
8.548387097, 8, 5.548387097, 5.35483871, 5.322580645, 2.612903226, 
1.483870968, 1.225806452, 0.258064516)), row.names = c(NA, 18L
), class = "data.frame"), `2` = structure(list(Study = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), Model = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L), Residuals = c(20.19354839, 16.5483871, 15.74193548, 
14.61290323, 7.064516129, 6.580645161, 5.64516129, 4.580645161, 
4.612903226, 3.612903226, 3.35483871, 2.741935484, 2.419354839, 
1.64516129, 1.35483871, 1.903225806, 0.516129032)), row.names = 19:35, class = "data.frame"), 
    `3` = structure(list(Study = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Model = c(3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L
    ), Residuals = c(23.80645161, 17.41935484, 15.58064516, 13.22580645, 
    11.32258065, 10.4516129, 6.709677419, 6.193548387, 5.741935484, 
    4.870967742, 4.322580645, 2.709677419, 2.677419355, 1.032258065, 
    1.129032258, 0.451612903, 1.064516129)), row.names = 36:52, class = "data.frame"))

problem/question

So, here's my problem: I need to apply the bootstrap function to the residuals column of each model, with the output eventually being a list of length 13 (where each element of the list is a vector comprised of 1000 SEE values) or as a dataframe/matrix with 13 columns and 1000 rows (the second is preferable as it will be used for further analyses and the package takes as input a dataframe).

I imagine one of the best ways to do this would be either through a for loop or though one of functions from the apply family. However, as far as syntax goes, I have no idea how to actually apply the function to a specific column of each dataframe when these are nested in a list format

What I have tried

Attempt one was to use the lapply function.

dat<- lapply(na.omit(new.data[[i]][, 4]), Bootstrap)

The [[i]][, 4] is my attempt at telling R to use the data from the fourth column from the ith element of the list. This partially worked but returned a list of length 18? Some of list elements also made no sense.

The second option i'm working on is to use a for loop:

for (i in 1:seq_along(new.data)){
result<- Bootstrap(new.data[[i]][,4])
return(result)
}

but this returns an error

In 1:seq_along(new.data) :
  numerical expression has 13 elements: only the first used

I also have no idea how to actually save the results into list or matrix format and my for loop skills could use more work... so there's that.

There's probably going to a very simple answer to this, so thank you in advance for any and all suggestions. I really need to up my hours practicing coding :)

score 0 · Accepted Answer · answered Jun 22 '21 at 14:43

0

You can do

dat <- lapply(new.data, function(dataFrameInList) {
    Bootstrap(na.omit(dataFrameInList[["Residuals"]]))
})

I hope the naming is clear to understand. When using lapply over a list, it grabs each element, in your case the data.framesI called dataFrameInList as the "loop variable". Then, choose the residuals via dataFrameInList[["Residuals"]]. Alternatively you could use dataFrameInList[,"Residuals"] or dataFrameInList[,4]. Throw away the NAs and finally apply your Bootstrap-function.

answered Jun 22 '21 at 14:43

Jonas

1,760
1
3
12

This worked perfectly, thank you so much! I decided to change ```lapply``` to ```sapply``` as well to get the data in matrix format. Your explanation and naming was very clear, but there is just a couple of things I don't get: 1. Why does this work but indexing with ```[[i]][, 4]``` does not? I get that it iterates over each element which is a ```data.frame``` 2. I get that the function is a so-called "anonymous" function, but how does it work? Could you explain a little more as I think this will help my thought process for future problems. Again, thank you greatly. – Cheddar_Thought Jun 22 '21 at 14:59
Maybe https://stackoverflow.com/questions/3505701/grouping-functions-tapply-by-aggregate-and-the-apply-family does help you – Jonas Jun 22 '21 at 15:19
1

Thank you again for all of your help, Jonas. I have now been able to streamline the process into one single chunk of code using ```tapply``` and anonymous functions at each stage of the process. – Cheddar_Thought Jun 23 '21 at 15:37

Applying user defined functions to a specific column of each dataframe in a list of dataframes?

1 Answers1