Similar questions were asked and answered e.g. here:
- Select the top N values by group
- Get top k records per group, where k differs by group, in R data.table
Now my aim extends the there asked questions by wanting to not not select but modify a column by for the N highest values within a group that fulfil one or multiple conditions. Important hereby is that I would like to modify the column by reference ( := ) as my data.table has several hundred million rows.
So far I could come up with this solution:
library(data.table)
# Create a sample data.table
dt <- data.table(x = rep(1:10, 2), y = rep(c(1:4,rep(5,5),6), 2), group = rep(c(1, 2), each = 10))
# So y is sorted
setorder(dt, group, -y)
# Is x < 8
dt[, cond := x <= 8]
# Is y within the 3 biggets y where x < 8
dt[cond == TRUE, cond_max := y %in% head(y, 3), by = group]
# "Clean" result column
dt[cond_max == TRUE, result := TRUE]
dt[, c("cond", "cond_max") := NULL]
dt
One problem is that because of duplicated values in y and as i use %in% 4 rows per group are set to be the highest 3. In case of duplicates I would like to get only the first 3 rows (similar to which.max()). Further it requires temporary columns for each condition.
As a kind of NOT working pseudo code idea:
dt <- data.table(x = rep(1:10, 2), y = rep(c(1:4,rep(5,5),6), 2), group = rep(c(1, 2), each = 10))
setorder(dt, group, -y)
#This does NOT work
dt[x <= 8][, head(.SD, 3), by=group][, result := TRUE]