Get top k records per group, where k differs by group, in R data.table

Question

I have two data.tables:

Values to extract the top k from, per group.
A mapping from group to the k values to select for that group.

how to find the top N values by group or within category (groupwise) in an R data.frame addresses this question when k does not vary by group. How can I do this? Here's sample data and the desired result:

Values:

(dt <- data.table(id=1:10,
                  group=c(rep(1, 5), rep(2, 5))))
#     id group
#  1:  1     1
#  2:  2     1
#  3:  3     1
#  4:  4     1
#  5:  5     1
#  6:  6     2
#  7:  7     2
#  8:  8     2
#  9:  9     2
# 10: 10     2

Mapping from group to k:

(group.k <- data.table(group=1:2, 
                       k=2:3))
#    group k
# 1:     1 2
# 2:     2 3

Desired result, which should include the first two records from group 1 and the first three records from group 2:

(result <- data.table(id=c(1:2, 6:8),
                      group=c(rep(1, 2), rep(2, 3))))
#    id group
# 1:  1     1
# 2:  2     1
# 3:  6     2
# 4:  7     2
# 5:  8     2

Applying the solution to the above-linked question after merging returns this error:

merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, k), by=group])
# Error: length(n) == 1L is not TRUE

related: https://stackoverflow.com/questions/56166410/show-top-bottom-k-in-each-group-using-data-table/56166489#56166489 — IVIM, May 16 '19 at 13:56

score 5 · Answer 1 · edited May 23 '17 at 12:26

5

I'd rather do it as:

dt[group.k, head(.SD, k), by=.EACHI, on="group"]

because it's quite clear to see what the intended operation is. j can be .SD[1:k] of course. Both these expressions will very likely be (further) optimised (for speed) in the next release.

See this post for a detailed explanation of by=.EACHI until we wrap those vignettes.

edited May 23 '17 at 12:26

Community

1
1

answered Nov 29 '15 at 22:43

Arun

116,683
26
284
387

score 1 · Answer 2 · edited May 23 '17 at 11:47

1

After merging in the k by group, a similar approach to https://stackoverflow.com/a/14800271/1840471's solution can be applied, you just need a unique to avoid the length(n) error:

merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, unique(k)), by=group])
#    group id k
# 1:     1  1 2
# 2:     1  2 2
# 3:     2  6 3
# 4:     2  7 3
# 5:     2  8 3

edited May 23 '17 at 11:47

Community

1
1

answered Nov 29 '15 at 22:19

Max Ghenis

14,783
16
84
132

Could also written as `dt[group.k, on = "group"][, .SD[1:k[1L]], by = group]` since `k[1]` will always be the same as `unique(k)` – Rich Scriven Nov 29 '15 at 22:35

Get top k records per group, where k differs by group, in R data.table

2 Answers2

Linked