Group-by in data.table with choosing first element in multiple columns

Question

Given the data table:

DT <- data.table(C1=1:6,C2=10:15,C3=1:3,C4=c("A","B")) 

   C1 C2 C3 C4
1:  1 10  1  A
2:  2 11  2  B
3:  3 12  3  A
4:  4 13  1  B
5:  5 14  2  A
6:  6 15  3  B

If I want to group by C4, compute the mean on C3 and display the first element in C1, I would do this:

DT[, list(C1=C1[1], AVG_C3=mean(C3), Freq=.N), by=C4]
   C4 C1 AVG_C3 Freq
1:  A  1      2    3
2:  B  2      2    3

If I would like to get the first element in C2, then I add C2[1] in the above line:

DT[, list(C1=C1[1], C2=C2[1], AVG_C3=mean(C3), Freq=.N), by=C4]

How would I tackle this task given the fact that I have hundreds of columns and I wish to see the first element as explained above?

Maybe there is some simple way like working with a vector of desired columns cols <- c("C1","C2"), but I can't figure it out how to assign to cols within DT[]. I had a look at assigning to multiple columns here Assign multiple columns using := in data.table, by group , but I'm not sure how to implement it for this case. I would appreciate your feedback :)

score 4 · Accepted Answer · answered Jun 08 '16 at 00:17

4

Suppose you want to see the first element of C1, C2 and C3 respectively, you can use the head on the .SD and specify the column names using .SDcols.

cols <- c("C1", "C2", "C3")
DT[, c(head(.SD, 1), list(AVG_C3=mean(C3), Freq=.N)), by=C4, .SDcols = cols]

   C4 C1 C2 C3 AVG_C3 Freq
1:  A  1 10  1      2    3
2:  B  2 11  2      2    3

answered Jun 08 '16 at 00:17

Psidom

209,562
33
339
356

3

Or even just `DT[, c(.SD[1], .(AVG_C3=mean(C3), Freq=.N)), by=C4, .SDcols = cols]` since `.SD` is a `data.table` itself, so `.SD[1]` grabs the first row. – thelatemail Jun 08 '16 at 01:11

akrun · Answer 2 · 2016-06-08T02:20:41.947

We can do the summary statistics on specified columns by 'C4' and join on the original dataset to get the corresponding rows.

DT[DT[, .(Avg_C3= mean(C3), freq= .N, C1 = C1[1L]) , by = C4], on = c("C4", "C1")]
#   C1 C2 C3 C4 Avg_C3 freq
#1:  1 10  1  A      2    3
#2:  2 11  2  B      2    3

Or an option using dplyr

library(dplyr)
DT %>% 
  group_by(C4) %>%
  mutate(Avg_C3 = mean(C3), freq = n(), C1 = first(C1)) %>% 
  slice(1L)
#      C1    C2    C3    C4 Avg_C3  freq
#   <int> <int> <int> <chr>  <dbl> <int>
#1     1    10     1     A      2     3
#2     2    11     2     B      2     3

Group-by in data.table with choosing first element in multiple columns

2 Answers2