2

I want to aggregate values of a data.table in R via multiple functions by a grouping variable AND keep the information that is in other columns (not included in the aggregation) in the corresponding rows (=the same row as the aggregation). An example:

Note: The code uses this which_quantile()-function (using sort(x) instead of order(x) in its code). It finds an actual value of a data set which is close to the defined quantile.

# sample data
dt <- structure(list(State = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
.Label = c("AK", "RI"), class = "factor"), Company = structure(1:8, .Label = c("A", 
"B", "C", "D", "E", "F", "G", "H"), class = "factor"), Employees = c(82L, 
104L, 37L, 24L, 19L, 118L, 88L, 42L), Number=c(1L,2L,3L,4L,5L,6L,7L,8L), Number2=c(9,10,11,12,13,14,15,16)),
.Names = c("State", "Company", "Employees", "Number", "Number2"), class = "data.frame", row.names = c(NA, 8L))

require(data.table)
setDT(dt)

# aggregation
agg <- dt[ , .(max = max(Employees),
               min = min(Employees),
               quantile70 = which.quantile(Employees, 0.7)), by=State]
agg_m <- dt[agg, on="State"]

aggregating a DT leads to the following output:

     State  max   min   quantile70
1:    AK    104   24    82
2:    RI    118   19    88

merging the aggregation with the original DT to that:

     State Company Employees Number Number2    max      min quantile70
1:    AK       A        82      1       9      104       24     82
2:    AK       B       104      2      10      104       24     82
3:    AK       C        37      3      11      104       24     82
4:    AK       D        24      4      12      104       24     82
5:    RI       E        19      5      13      118       19     88
6:    RI       F       118      6      14      118       19     88
7:    RI       G        88      7      15      118       19     88
8:    RI       H        42      8      16      118       19     88

Question: How can I aggregate the data.table and at the same time keep the corresponding values in the columns Company, Number and Number2? The max. of the Number column in the State AK is 104 and the corresponding value in column Number2 is 10. The min is 24 and the corresponding value 12 and so on. How to keep this information when aggregating a data.table?

desired output:

    State Company Employees Number Number2 aggregation
1:    AK       A        82      1       9      quantile70
2:    AK       B       104      2      10      max
3:    AK       D        24      4      12      min
4:    RI       E        19      5      13      min
5:    RI       F       118      6      14      max
6:    RI       H        88      8      16      quantile70

The question is similar to this one. The sample data was also taken from there and adapted.


The following aggregations don't solve my problem:

dt[ ,.SD[ which.max(Employees) ], by=State]
dt[dt[ ,.I[ which.max(Employees) ], by=State ]$V1]
# only which.max() OR which.min() are possible

dt[ , max_Empl := max(Employees), by=State ]
# only ONE aggregation function at a time is possible
Community
  • 1
  • 1
andschar
  • 3,504
  • 2
  • 27
  • 35
  • 1
    which.quantile is not a function in R... – Frank Nov 22 '16 at 19:20
  • that's why i've linked the site where I got it from.. can also be omitted – andschar Nov 22 '16 at 19:24
  • Ok, yeah, best to make the question more self-contained one way or another I think. I'd drop it if it is not essential to this question. – Frank Nov 22 '16 at 19:26
  • To assign multiple columns at once, you can do `dt[, \`:=\`(up = max(Employees), dn = min(Employees)), by=State]` or something like `dt[, (cols) := .(col1, col2, ...), by=State, .SDcols=cols]` – Frank Nov 22 '16 at 19:28
  • thanks, but then the data.table isn't aggregated (I.e only the max and min value per group (AND their corresponding values from other columns) are kept) – andschar Nov 22 '16 at 19:39
  • Hm, maybe you could show the output you're after for this example more explicitly. – Frank Nov 22 '16 at 19:42

1 Answers1

1

Following @eddi's canonical answer on subsetting by group...

aggi <- dt[ , .(max = .I[which.max(Employees)],
               min = .I[which.min(Employees)],
               quantile70 = .I[which.quantile(Employees, 0.7)]), by=State]

From here, you can do

maggi <- melt(aggi, id="State")

dt[maggi$value][, v := maggi$variable][]

   State Company Employees Number Number2          v
1:    AK       B       104      2      10        max
2:    RI       F       118      6      14        max
3:    AK       D        24      4      12        min
4:    RI       E        19      5      13        min
5:    AK       A        82      1       9 quantile70
6:    RI       G        88      7      15 quantile70
Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Thank you for your answer. your first line helps me to calculate the min and max values and retain the additional information! However this only works with min and max. (You're right about the redundancies btw) I edited my question back, containing again the which.quantile(). this function can't be incorporated in your code AFAIK. Thanks for any further help – andschar Nov 24 '16 at 14:51
  • @andrasz Okay, I changed it. By the way, you can click the "edited x ago" to see the previous version again. A minor point, but I don't think "aggregation" is really the right term for this; it's just filtering to some quantiles, right? – Frank Nov 24 '16 at 15:49