Create a new variable that is the average of one variable conditional on two other variables (and maintain all other variables in the data set)

Question

Here is a (shortened) sample from a data set I am working on. The sample represents data from an experiment with 2 sessions (session_number), in each session participants completed 5 trials (trial_number) of a hand grip exercise (so, 10 in total; 2 * 5 = 10). Each of the 5 trials has 3 observations of hand grip strength (percent_of_maximum). I want to get the average (below, I call it mean_by_trial) of these 3 observations for each of the 10 trials.

Finally, and this is what I am stuck on, I want to output a data set that is 20 rows long (one row for each unique trial, there are 2 participants and 10 trials for each participant; 2 * 10 = 20), AND retains all other variables. All the other variables (in the example there are: placebo, support, personality, and perceived_difficulty) will be the same for each unique Participant, trial_number, or session_number (see sample data set below).

I have tried this using ddply, which is pretty much what I want, but the new data set does not contain the other variables in the data set (new_dat only contains trial_number, session_number, Participant and the new mean_by_trial variable). How can I maintain the other variables?

#create sample data frame
dat <- data.frame(
  Participant = rep(1:2, each = 30),
  placebo = c(replicate(15, "placebo"), replicate(15, "control"), replicate(15, "control"), replicate(15, "placebo")),
  support = rep(sort(rep(c("support", "control"), 3)), 10),
  personality = c(replicate(30, "nice"), replicate(30, "naughty")),
  session_number = c(rep(1:2, each = 15), rep(1:2, each = 15)),
  trial_number = c(rep(1:5, each = 3), rep(1:5, each = 3), rep(1:5, each = 3), rep(1:5, each = 3)),
  percent_of_maximum = runif(60, min = 0, max = 100),
  perceived_difficulty = runif(60, min = 50, max = 100)
)

#this is what I have tried so far
library(plyr)
new_dat <- ddply(dat, .(trial_number, session_number, Participant), summarise, mean_by_trial = mean(percent_of_maximum), .drop = FALSE)

I want new_dat to contain all the variables in dat, plus the mean_by_trial variable. Thank you!

score 2 · Answer 1 · answered Mar 26 '19 at 11:52

We can use mutate instead of summarise to create a column in the dataset and then do slice

library(dplyr)
out <- ddply(dat, .(trial_number, session_number, Participant), 
   plyr::mutate, mean_by_trial = mean(percent_of_maximum), .drop = FALSE)
out %>%
       group_by(trial_number, session_number, Participant) %>%
       slice(1)

If we use dplyr, then this can all be inside a chain

newdat <- dat %>% 
            group_by(trial_number, session_number, Participant) %>%
            mutate(mean_by_trial = mean(percent_of_maximum)) %>%
            slice(1)
head(newdat)
# A tibble: 6 x 9
# Groups:   trial_number, session_number, Participant [6]
  Participant placebo support personality session_number trial_number percent_of_maximum perceived_difficulty mean_by_trial
#        <int> <fct>   <fct>   <fct>                <int>        <int>              <dbl>                <dbl>         <dbl>
#1           1 placebo control nice                     1            1               71.5                 95.5          73.9
#2           2 control control naughty                  1            1               38.9                 63.8          67.7
#3           1 control support nice                     2            1               97.1                 54.2          68.4
#4           2 placebo support naughty                  2            1               62.9                 86.2          40.4
#5           1 placebo support nice                     1            2               49.0                 95.8          65.7
#6           2 control support naughty                  1            2               80.9                 74.6          68.3

score 1 · Accepted Answer · answered Mar 26 '19 at 11:57

Here’s a tidyverse answer. First you want to group_by the variables of interest. Then calculate the desired mean in a new column using mutate.

As the value in the new mean column will be repeated across the variables, use the distinct function to retain uniqe rows. In other words, select a single row for each combination of Participant, session_number, and trial_number.

This is the answer (https://stackoverflow.com/a/39092166/9941764) provided in: R - dplyr Summarize and Retain Other Columns

new_dat <- dat %>%
    group_by(Participant, session_number, trial_number) %>%
    mutate(mean = mean(percent_of_maximum)) %>% 
    distinct(mean, .keep_all = TRUE)

Create a new variable that is the average of one variable conditional on two other variables (and maintain all other variables in the data set)

2 Answers2