1

Another beginner question for aggregation.

I want to aggregate multiple columns in a using values multiple columns. Yes I have seen some previous similar post. However I think the difference here is I'm trying to aggregate based on multiple columns.

For example my data frame:

column1 column2 column3 V1  V2
A       a        7      90  600
A       a        7      90  600
A       b        7      80  600
A       b        6      70  5000
A       b        6      70  5000
....

Aggregate and sum the numbers in V1 and V2:

column1 column2 column3 V1  V2
A       a        7      180 1200
A       b        7      80  600
A       b        6      140 10000
....

Here's my minimized data and code:

#generate minimal sample data
column1 <- c("S104259","S2914138" ,"S999706","S1041120",
         rep("S1042529",6), rep('S1235729',4))
column2 <- c("  T6-R190116","T2-R190213" ,"T8-R190118",
         rep("T8-R190118",3), rep('T2-R190118',3),rep('T6-R200118',4),'T1-R200118')
column3 <- c(rep("3S_DMSO",7),rep("uns_DMSO",5),rep("3s_DMSO",2))
output_1 <- c(664,292,1158,574,38,0,2850,18,74,8,10,0,664,30)
output_2 <- c(364,34,0,74,8,0,850,8,7,8,310,0,64,380)
df <-data.frame(column1,column2,column3,output_1,output_2)

#aggregate data by the same value in  column 1, 2 and 3

new_df <- aggregate(cbind(df$output_1,df$output_2), by=list(Category=df$column1), FUN=sum)

write.table(new_df, file = "aggregatedDMSO.txt",sep="\t", row.names = F, col.names = T)

So

  1. How can I pass column 1, 2 and 3 the same time into the list? I tried & them together and it didn't work.
  2. Second greedy question: my real dataset will have a lot columns of output, is there another way than cbind hard code all their names? (yes for some cases I can df[,all columns from a certain index], but other times I might need to omit a few columns)

Thank you, ML

Jaap
  • 81,064
  • 34
  • 182
  • 193
ML33M
  • 341
  • 2
  • 19
  • Nothing [here](https://stackoverflow.com/q/21644848/5325862) or in the 2 posts it's listed as duplicating help? – camille Feb 29 '20 at 23:13
  • @Camille ouch, yes these are helpful. Sorry my google gave me different post on stack overflow, with only one column of variable to concatenate... – ML33M Mar 01 '20 at 01:53
  • @Camille hi, I think it is similar, but not quite the same, given it's going for the mean. I would prefer the answer provided by Akrun, and his method in explaining :) – ML33M Mar 02 '20 at 00:25
  • 1
    Changing `FUN = mean` to `FUN = sum` is a pretty trivial difference, don't you think? – camille Mar 02 '20 at 02:13

2 Answers2

3

We can use the formula method:

aggregate(.~ column1 + column2 + column3,  df, sum)

Or if use the 'S3' method:

aggregate(df[c('output_1', 'output_2')], by = df[1:3], FUN = sum)

If we are using , then it becomes more easier

library(dplyr)
df %>%
    group_by_at(vars(starts_with('column'))) %>%
    summarise_at(vars(starts_with('output')), sum)

We can also pass multiple functions at the same time

df %>%
     group_by_at(vars(starts_with('column'))) %>%
     summarise_at(vars(starts_with('output')), list(sum = sum, mean = mean, sd = sd))
massisenergy
  • 1,764
  • 3
  • 14
  • 25
akrun
  • 874,273
  • 37
  • 540
  • 662
2

I always prefer using base packages and packages preinstalled with R. In terms of aggregation however I much prefer the ddply way because of its flexibility. You can aggregate with mean sum sd or whatever descriptive you choose.

column1<-c("S104259","S2914138","S999706","S1041120",rep("S1042529",6),rep('S1235729',4))
column2<-c("T6-R190116","T2-R190213","T8-R190118",rep("T8-R190118",3),rep('T2-R190118',3),rep('T6-R200118',4),'T1-R200118')
column3<-c(rep("3S_DMSO",7),rep("uns_DMSO",5),rep("3s_DMSO",2))
output_1<-c(664,292,1158,574,38,0,2850,18,74,8,10,0,664,30)
output_2<-c(364,34,0,74,8,0,850,8,7,8,310,0,64,380)
df<-data.frame(column1,column2,column3,output_1,output_2)

library(plyr)
factornames<-c("column1","column2","column3")
plyr::ddply(df,factornames,plyr::numcolwise(mean,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sum,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sd,na.rm=TRUE))
  • thank you :) hmmm, sorry it seems I can only vote for 1 answer. I tested both answers they all work. The one below has more educational impact on me as a rookie, I hope there's no hard feelings if I click the one below. – ML33M Mar 01 '20 at 01:52
  • We don't have hard feelings here, this is a nice community, we come here to have fun and do our job. I am using the cite more than 2 years and I have never encountered "hard feelings" if you encounter someone like that this has to somehow be reported to keep the community pleasant. – Dimitrios Zacharatos Mar 01 '20 at 09:14
  • thank you! Indeed I'm having lots fun here learning, and people are awesome and very helpful. I just want to also contribute to the positive vibe here, or at least don't do anybody wrong :) – ML33M Mar 01 '20 at 17:57
  • 1
    Just FYI `plyr` has been in retirement for a few years now in favor of `dplyr`/`purrr` – camille Mar 02 '20 at 02:12
  • @camille this is important to know In general I am against using libraries. The best thing someone can do is to write his own thing to do the above I think. This will not bloat his own pagkage with external libraries and dependencies – Dimitrios Zacharatos Mar 02 '20 at 11:38