1

I have found a few stackoverflow questions very similar but the answers are not what I am looking for (Loop through columns and apply ddply, Aggregate / summarize multiple variables per group (i.e. sum, mean, etc))

The main difference is the answers simplify their problems in a way that does not use a for loop (nor apply) but uses aggregate (or similar) instead. However I have a large chunk of code working smoothly to do various summaries, stats, and plots, so what I would really like to do is get a loop or function working. The issue I am currently facing is going from the column name stored as q in the loop to the actual column (get() is not working for me). See below.

My data set is similar to below but with 40 features:

Subject <- c(rep(1, times = 6), rep(2, times = 6))
GroupOfInterest <- c(letters[rep(1:3, times = 4)])
Feature1 <- sample(1:20, 12, replace = T)
Feature2 <- sample(400:500, 12, replace = T)
Feature3 <- sample(1:5, 12, replace = T)
df.main <- data.frame(Subject,GroupOfInterest, Feature1, Feature2, 
Feature3, stringsAsFactors = FALSE)

enter image description here

My attempts so far have used a for loop:

Feat <- c(colnames(df.main[3:5]))    
for (q in Feat){
df_sum = ddply(df.main, ~GroupOfInterest + Subject,
            summarise, q =mean(get(q)))
  }

Which I hope to provide an output like below (although I realize the way it is now a separate merge function would be needed) :

enter image description here

However depending on how I do it I either get an error ("Error in get(q) : invalid first argument") or it averages all values of a feature rather than grouping by Subject and GroupOfInterest.

I have also tried using lists and lapply but am running into similar difficulties.

From what I have gathered my issue lies in that ddply is expecting Feature1. But if I loop through I am either providing it with "Feature1" (string) or (1,14,14,16,17...) which no longer is part of the dataframe which is needed to group by the Subject and Group.

Thanks so much for any help you can offer with solving this problem and teaching me how this process works.

Kirk Geier
  • 499
  • 8
  • 15

3 Answers3

2

Edited based on comment; need to include as.character(.)

Could you use summarise_at? And helper functions vars(contains(...))?

df.main %>% 
    group_by(Subject, GroupOfInterest) %>% 
    summarise_at(vars(contains("Feature")), funs(mean(as.numeric(as.character(.)))))
CPak
  • 13,260
  • 3
  • 30
  • 48
  • 1
    `plyr` is an older package with a successor that is (imo) easier and more intuitive to use. @CPak's solution uses `dplyr` which makes this problem very easy. – Jake Kaupp Jan 09 '18 at 19:28
  • https://stackoverflow.com/questions/10178203/sending-in-column-name-to-ddply-from-function, apparently `summarise` in `plyr` is difficult to do this with. – Jake Kaupp Jan 09 '18 at 19:34
  • 1
    your solution doesn't work here, you need to convert to character befor to numeric : `df.main %>% group_by(Subject, GroupOfInterest) %>% summarise_at(vars(contains("Feature")), funs(mean(as.numeric(as.character(.)))))` – denis Jan 09 '18 at 19:57
2

the dlyr solution is given above, but to be fair here is the data.table one

DT <- setDT(df.main)
DT[,lapply(.SD,function(x){mean(as.numeric(as.character(x)))}),
.SDcols = names(DT)[grepl("Feature",names(DT))], by = .(Subject,GroupOfInterest)]

   Subject GroupOfInterest Feature1 Feature2 Feature3
1:       1               a      6.5    459.5      2.0
2:       1               b     11.0    480.5      4.0
3:       1               c      9.5    453.0      4.5
4:       2               a      3.5    483.0      1.5
5:       2               b      8.0    449.0      3.5
6:       2               c     11.5    424.0      1.0
denis
  • 5,580
  • 1
  • 13
  • 40
2

OP mentioned to use simple for-loop for this transformation on data. I understand that there are many other optimized way to solve this but in order to respect OP desired I tried using for-loop based solution. I have used dplyr as plyr is old now.

library(dplyr)
Subject <- c(rep(1, times = 6), rep(2, times = 6))
GroupOfInterest <- c(letters[rep(1:3, times = 4)])
Feature1 <- sample(1:20, 12, replace = T)
Feature2 <- sample(400:500, 12, replace = T)
Feature3 <- sample(1:5, 12, replace = T)
#small change in the way data.frame is created 
df.main <- data.frame(Subject,GroupOfInterest, Feature1, Feature2, 
 Feature3, stringsAsFactors = FALSE)

Feat <- c(colnames(df.main[3:5])) 

# Ready with Key columns on which grouping is done
resultdf <- unique(select(df.main, Subject, GroupOfInterest))
#> resultdf
#  Subject GroupOfInterest
#1       1               a
#2       1               b
#3       1               c
#7       2               a
#8       2               b
#9       2               c


#For loop for each column
for(q in Feat){
  summean <- paste0('mean(', q, ')')
  summ_name <- paste0(q) #Name of the column to store sum
  df_sum <- df.main %>% 
     group_by(Subject, GroupOfInterest) %>%
    summarise_(.dots = setNames(summean, summ_name)) 
  #merge the result of new sum column in resultdf
  resultdf <- merge(resultdf, df_sum, by = c("Subject", "GroupOfInterest"))
}

# Final result
#> resultdf
#  Subject GroupOfInterest Feature1 Feature2 Feature3
#1       1               a      6.5    473.0      3.5
#2       1               b      4.5    437.0      2.0
#3       1               c     12.0    415.5      3.5
#4       2               a     10.0    437.5      3.0
#5       2               b      3.0    447.0      4.5
#6       2               c      6.0    462.0      2.5
MKR
  • 19,739
  • 4
  • 23
  • 33
  • 1
    Thanks so much for your answer! Yes I agree there are better ways to do it but it was nice learning this way so I didn't have to alter my existing code to much. – Kirk Geier Jan 12 '18 at 19:26
  • 1
    I edited my post to include your way of making the df (stringsasfactors = F). Just to add what I needed to change to make my code work for future readers: 1. summarise_ is the dplyr function which I should have used instead 2. .dots= is the way to tell dplyr you are feeding in new arguements 3. Having the function concatenated into a new variable is a way around the function not accepting your variable (q). *Feel free to correct or clarify – Kirk Geier Jan 12 '18 at 19:34