How to get mean for all participants after selecting only a certain number of trials

Question

I have a dataset of 500 trials per participant that I want to sample from in various quantities (i.e. I want to sample the same number of trials from each participant) and then compute the mean for each participant. Instead of doing so, it is creating a file with a one mean for each participant separately for each "num", e.g. if the mean for participant 1 with 125 trials is 426 that will be the whole file, then another file for participant 1 with 150 trials with a single value, and that is what happens for all participants. I was aiming for a single file for 125 with the means for all participants, then another file with the means for 150, etc.

num <- c(125,150,175,200,225,250,275,300,325,350,375,400)

Subset2 <- list()


for (x in 1:12){
  for (j in num){
   Subset2[[x]] <- improb2 %>% group_by(Participant) %>% sample_n(j) %>% summarise(mean = mean(RT))
  
  
}}

Here is a reproducible example:

RT <- sample(200:600, 10000, replace=T)
df <- data.frame(Participant= letters[1:20]) 
df <- as.data.frame(df[rep(seq_len(nrow(df)), each = 500),])

improb2 <- cbind(RT, df)
improb2 <- improb2 %>% rename(Participant = `df[rep(seq_len(nrow(df)), each = 500), ]`)

One of the desired dataframes in subset2 would be something like:

Subset2[[1]]

Participant  mean
   <chr>       <dbl>
 1 P001         475.
 2 P002         403.
 3 P003         481.
 4 P004         393.
 5 P005         376.
 6 P006         402.
 7 P007         497.
 8 P008         372.
 9 P010         341.

Your `Subset2` is a `list` of summarised output. Do you want `sapply(Subset2, function(x) mean(x$mean)`) — akrun, Jan 14 '21 at 18:08
I want this bit "improb2 %>% group_by(Participant) %>% sample_n(j) %>% summarise(mean = mean(RT))" to generate a table with the means for all participants for a certain value of num (e.g. 150) and subset2 to be a list of all this tables. — CatM, Jan 14 '21 at 19:36
In your code output, I get a list of data.frame with each data.frame showing the `mean` values for different Participant — akrun, Jan 14 '21 at 19:38
I added the bit of what I wish the outcome looked like, maybe that helps — CatM, Jan 14 '21 at 19:42
No, there are 45,but 12 num so I would expect 12 tables with the outputs for each num for all participants. In the reproducible example, there were 20 I think. — CatM, Jan 14 '21 at 19:45

LMc · Accepted Answer · 2021-01-14T20:43:05.213

1

This answer uses tidyverse and outputs a list object data where the names are the sample sizes. To access each sample size summary you have to use backticks data$`125` . data$`125` is a tibble object. I made a comment in the output where you can change it to a data.frame object if you need.

library(tidyverse)

num <- c(125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400)

# create function to sample data by certain size and summarize by mean
get_mean <- function(x, n) { 
  dplyr::group_by(x, Participant) %>% # group by participant
    dplyr::sample_n(n) %>% # randomly sample observations
    dplyr::summarize(mean = mean(RT), # get mean of RT
                     n = n(), # get sample size
                     .groups = "keep") %>% 
    dplyr::ungroup()
# add a pipe to as.data.frame if you don't want a tibble object
}

# create a list object where the names are the sample sizes
data <- lapply(setNames(num, num), function(sample_size) {get_mean(df, n = sample_size)})

head(data$`125`)

 Participant  mean     n
  <chr>       <dbl> <int>
1 V1           20.2   125
2 V10          19.9   125
3 V11          19.8   125
4 V12          20.2   125
5 V2           20.5   125
6 V3           20.0   125

Data

I wasn't 100% sure what your dataset looked like, but I believe it looks something like this:

# create fake data for 45 participants with 500 obs per participant
df <- replicate(45, rnorm(500, 20, 4)) %>%
  as.data.frame.matrix() %>% 
  tidyr::pivot_longer(everything(), 
                      names_to = "Participant", # id column
                      values_to = "RT") %>% # value column
  dplyr::arrange(Participant)


head(df) # Participant repeated 500 times, with 500 values in RT
 Participant    RT
  <chr>       <dbl>
1 V1           24.7
2 V1           15.2
3 V1           21.1
4 V1           21.6
5 V1           20.3
6 V1           25.6

If this is a similar structure (long with repeated participant IDs and a single column RT of values) then the above should work.

edited Jan 14 '21 at 20:43

answered Jan 14 '21 at 20:37

LMc

12,577
3
31
43

This is great, exactly what I need! I had a reproducible example in the post but you guess perfectly! Thank you so much! – CatM Jan 14 '21 at 20:58
Great, happy to help. To fix your original loop, you have a nested loop when you don't need it as @akrun suggested. Delete the inner loop for `j` and change your assignment line to `Subset2[[x]] <- improb2 %>% group_by(Participant) %>% sample_n(num[x]) %>% summarise(mean = mean(RT))`. Notice the change to the `sample_n` function from your OP. `x` is the index of each sample stored in `num`. – LMc Jan 14 '21 at 21:10
Do you know what is the best way of ensure that the code is reproducible? Is it enough to set seed outside lapply function? – CatM Jan 14 '21 at 21:14
You'll get better the more questions you ask. I would have put the `RT` vector as a column in the dataframe along with the letters you used as an example of participant IDs. Also provide general column names (see the column name you provided on `df`). Here is a [link](https://stackoverflow.com/questions/49994249/example-of-using-dput) about using `dput` for asking SO questions that's very helpful. SO also has a [help page](https://stackoverflow.com/help/minimal-reproducible-example) on producing minimal reproducible examples. You showed a good attempt, so overall good question. – LMc Jan 14 '21 at 21:24
1

Yes, setting the seed outside the `lapply` should create the same results each time you run everything. For example when I run this code multiple times I get the exact same output: `set.seed(1); lapply(1:5, function(x) rnorm(x))` – LMc Jan 14 '21 at 21:29
I have only now had the chance of giving another proper look at the code and I was doing something I think is not a good idea, i.e. before I was bootstrapping 500 trials and then I would use the function you came up with to sample the number of trials I needed from the bootstrapped data. But now I think I should just use your function to bootstrap the right number directly. Would that be a better approach? If so, how would I go about changing the function (get_mean) ? I was using sample(x) to bootstrap. – CatM Jan 25 '21 at 17:18
Fix the bootstrap sample size with by changing `sample_n()`. Then change `lapply(1:, get_mean)`. Note: `get_mean` is no longer a function of anything since you're fixing the sample size. Your output will then be a list where each element is a bootstrapped sample summary. The length of your list will be the number of bootstraps. Overall though I don't know that this way is the most efficient way to store your data. If you number of bootstraps isn't terribly large it will probably be fine. – LMc Jan 25 '21 at 19:05
I am not sure I understood what you said, could you please answer here: https://stackoverflow.com/questions/65889687/bootstrapping-responses-per-participant – CatM Jan 25 '21 at 22:04

How to get mean for all participants after selecting only a certain number of trials

1 Answers1

Linked