1

I want to apply the aggregate function (or something with the same effect) to average say the value of Column_B , based on the groups in Column_A. However, in practice there will be a large amount of groups in A, and for smaller groups (say n < 30) I'm often not as interested in the result. Therefore I'd like to include the group size in a column next to it, and also possibly filter by group size.

So say I have an example table such as:

x = 
Column_A      Column_B
Person1       4  
Person1       6  
Person1       7  
Person2       8   
Person2       11  
Person2       10  
Person2       13  
Person2       15  
Person3       19   

My desired output would be

Column_A    Column_B Avg.  n
Person1     5.66           3
Person2     11.4           5
Person3     19             1

I know the first two columns can be achieved using:

aggregate(x[, 2], list(x$Column_A), mean)

Similarly, I could count the number of times each person appears with a separate command, but just wondering is there any easier way of combining these together? I thought maybe adding sum as another parameter after mean in the function, but couldn't find an example of that.

d.b
  • 32,245
  • 6
  • 36
  • 77
Eamonn
  • 13
  • 2
  • 1
    Try `library(dplyr); df1 %>% group_by(Column_A) %>% summarise(Colum_BAvg = mean(Column_B), n = n())` – akrun Jul 11 '19 at 20:44
  • 3
    `aggregate(Column_B ~ Column_A, x, function(x) c(mean = mean(x), n = length(x)))` This returns a two column data frame whose second column is named `Column_B` and is itself a 2 column matrix with columns `mean` and `n`. – G. Grothendieck Jul 11 '19 at 20:48

1 Answers1

2
do.call(rbind, lapply(split(df1$Column_B, df1$Column_A), function(x){
    data.frame(average = mean(x), n = length(x))
}))
#          average n
#Person1  5.666667 3
#Person2 11.400000 5
#Person3 19.000000 1

OR

foo = c("mean", "length", "min", "median", "sum")
grp = "Column_A"
val = c("Column_B")
Reduce(function(x, y) merge(x, y, by = grp, all = TRUE), 
       lapply(foo, function(f){
           setNames(aggregate(df1[val], df1[grp], f),
                    c(grp, paste(val, f, sep = ".")))
       }))
#  Column_A Column_B.mean Column_B.length Column_B.min Column_B.median Column_B.sum
#1  Person1      5.666667               3            4               6           17
#2  Person2     11.400000               5            8              11           57
#3  Person3     19.000000               1           19              19           19

Or a slightly modified version of G. Grothendieck's comment

df2 = aggregate(formula = Column_B~Column_A,
                data = df1,
                FUN = function(x) data.frame(mean = mean(x),
                                             n = length(x)),
                simplify = FALSE)
df2 = data.frame(df2$Column_A, do.call(rbind, df2$Column_B))
df2
#  df2.Column_A      mean n
#1      Person1  5.666667 3
#2      Person2 11.400000 5
#3      Person3 19.000000 1
d.b
  • 32,245
  • 6
  • 36
  • 77