1

I have two data.tables:

  1. Values to extract the top k from, per group.
  2. A mapping from group to the k values to select for that group.

how to find the top N values by group or within category (groupwise) in an R data.frame addresses this question when k does not vary by group. How can I do this? Here's sample data and the desired result:

Values:

(dt <- data.table(id=1:10,
                  group=c(rep(1, 5), rep(2, 5))))
#     id group
#  1:  1     1
#  2:  2     1
#  3:  3     1
#  4:  4     1
#  5:  5     1
#  6:  6     2
#  7:  7     2
#  8:  8     2
#  9:  9     2
# 10: 10     2

Mapping from group to k:

(group.k <- data.table(group=1:2, 
                       k=2:3))
#    group k
# 1:     1 2
# 2:     2 3

Desired result, which should include the first two records from group 1 and the first three records from group 2:

(result <- data.table(id=c(1:2, 6:8),
                      group=c(rep(1, 2), rep(2, 3))))
#    id group
# 1:  1     1
# 2:  2     1
# 3:  6     2
# 4:  7     2
# 5:  8     2

Applying the solution to the above-linked question after merging returns this error:

merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, k), by=group])
# Error: length(n) == 1L is not TRUE
Community
  • 1
  • 1
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
  • related: https://stackoverflow.com/questions/56166410/show-top-bottom-k-in-each-group-using-data-table/56166489#56166489 – IVIM May 16 '19 at 13:56

2 Answers2

5

I'd rather do it as:

dt[group.k, head(.SD, k), by=.EACHI, on="group"]

because it's quite clear to see what the intended operation is. j can be .SD[1:k] of course. Both these expressions will very likely be (further) optimised (for speed) in the next release.

See this post for a detailed explanation of by=.EACHI until we wrap those vignettes.

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387
1

After merging in the k by group, a similar approach to https://stackoverflow.com/a/14800271/1840471's solution can be applied, you just need a unique to avoid the length(n) error:

merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, unique(k)), by=group])
#    group id k
# 1:     1  1 2
# 2:     1  2 2
# 3:     2  6 3
# 4:     2  7 3
# 5:     2  8 3
Community
  • 1
  • 1
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
  • Could also written as `dt[group.k, on = "group"][, .SD[1:k[1L]], by = group]` since `k[1]` will always be the same as `unique(k)` – Rich Scriven Nov 29 '15 at 22:35