5

I have a large dataframe and want to standardise multiple columns while conditioning the mean and the standard deviation on values. Say I have the following example data:

set.seed(123)
df = data.frame("sample" = c(rep(1:2, each = 5)),
       "status" = c(0,1),
       "s1" = runif(10, -1, 1),
       "s2" = runif(10, -5, 5),
       "s3" = runif(10, -25, 25))

and want to standardise every s1-s3 while conditioning the mean and standard deviation to be status==0. If I should do this for say, s1 only I could do the following:

df = df %>% group_by(sample) %>%
  mutate(sd_s1 = (s1 - mean(s1[status==0])) / sd(s1[status==0]))

But my problem arises when I have to perform this operation on multiple columns. I tried writing a function to include with mutate_at:

standardize <- function(x) {
    return((x - mean(x[status==0]))/sd(x[status==0]))
}

df = df %>% group_by(sample) %>% 
  mutate_at(vars(s1:s3), standardize)

Which just creates Na values for s1-s3. I have tried to use the answer provided in: R - dplyr - mutate - use dynamic variable names, but cannot figure out how to do the subsetting.

Any help is greatly appreciated. Thanks!

J. Debost
  • 53
  • 3

1 Answers1

2

We could just use

df %>%
  group_by(sample) %>% 
  mutate_at(vars(s1:s3), funs((.- mean(.[status == 0]))/sd(.[status == 0])))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks for this great answer. Is there a way to create multiple new columns, such as `sd_s1, sd_s2, ...`, while keep the original columns, such as `s1, s2, ...` using the `mutate_at` function? – www Sep 06 '17 at 14:45
  • @ycw Thanks, You can use `df %>% group_by(sample) %>% mutate_at(vars(s1:s3), funs(sd = (.- mean(.[status == 0]))/sd(.[status == 0])))` – akrun Sep 06 '17 at 14:47
  • This is great. Thanks again. – www Sep 06 '17 at 14:50
  • @J.Debost Your code is not working for me. `df[,3:ncol(df)] %>% mutate_all(standardize)# Error in mutate_impl(.data, dots) : Evaluation error: object 'status' not found.` What is the version of your dplyr? I am using 0.7.2. THe `.` is referring back to the individual columns specified in the `vars` – akrun Sep 07 '17 at 04:50
  • 1
    I'm so sorry for the confusion! Your code is running just the way it's supposed to, standardising the specified columns by group using only mean and sd in controls (status == 0). Simple and efficient! Thank you very much for your help. I have deleted my two earlier comments. – J. Debost Sep 07 '17 at 07:43