Using data.table modify a column for N highest values by group that fulfil one or multiple conditions

Question

Similar questions were asked and answered e.g. here:

Now my aim extends the there asked questions by wanting to not not select but modify a column by for the N highest values within a group that fulfil one or multiple conditions. Important hereby is that I would like to modify the column by reference ( := ) as my data.table has several hundred million rows.

So far I could come up with this solution:


library(data.table)

# Create a sample data.table
dt <- data.table(x = rep(1:10, 2), y = rep(c(1:4,rep(5,5),6), 2), group = rep(c(1, 2), each = 10))

# So y is sorted
setorder(dt, group, -y)

# Is x < 8
dt[, cond := x <= 8]

# Is y within the 3 biggets y where x < 8
dt[cond == TRUE, cond_max := y %in% head(y, 3), by = group]

# "Clean" result column
dt[cond_max == TRUE, result := TRUE]
dt[, c("cond", "cond_max") := NULL]

dt

One problem is that because of duplicated values in y and as i use %in% 4 rows per group are set to be the highest 3. In case of duplicates I would like to get only the first 3 rows (similar to which.max()). Further it requires temporary columns for each condition.

As a kind of NOT working pseudo code idea:

dt <- data.table(x = rep(1:10, 2), y = rep(c(1:4,rep(5,5),6), 2), group = rep(c(1, 2), each = 10))

setorder(dt, group, -y)

#This does NOT work
dt[x <= 8][, head(.SD, 3), by=group][, result := TRUE]

Your last point about *"no more than N values if duplicates exist"* might be remedied by `head(unique(y), 3)`, but that changes the results and expectations significantly. If my answer is not quite enough, please expand on that part. — r2evans, Apr 07 '23 at 13:44
Hi, thanks for your comment, i have updated by toy example to hopefully illustrate better what I mean with duplicate values. — Markus, Apr 07 '23 at 14:02

r2evans · Answer 1 · 2023-04-07T14:21:50.630

(Updated for different OP data.)

This produces the same result in one step. Using your code up through result:

dt[, result2 := (x <= 8) & y %in% head(y[x <= 8], 3), by = .(group)]
dt
#         x     y group result result2
#     <int> <num> <num> <lgcl>  <lgcl>
#  1:    10     6     1     NA   FALSE
#  2:     5     5     1   TRUE    TRUE
#  3:     6     5     1   TRUE    TRUE
#  4:     7     5     1   TRUE    TRUE
#  5:     8     5     1   TRUE    TRUE
#  6:     9     5     1     NA   FALSE
#  7:     4     4     1     NA   FALSE
#  8:     3     3     1     NA   FALSE
#  9:     2     2     1     NA   FALSE
# 10:     1     1     1     NA   FALSE
# 11:    10     6     2     NA   FALSE
# 12:     5     5     2   TRUE    TRUE
# 13:     6     5     2   TRUE    TRUE
# 14:     7     5     2   TRUE    TRUE
# 15:     8     5     2   TRUE    TRUE
# 16:     9     5     2     NA   FALSE
# 17:     4     4     2     NA   FALSE
# 18:     3     3     2     NA   FALSE
# 19:     2     2     2     NA   FALSE
# 20:     1     1     2     NA   FALSE
#         x     y group result result2

If you really need the FALSE elements to be NA, then we can add on a simple reassignment:

dt[, result2 := (x <= 8) & y %in% head(y[x <= 8], 3), by = .(group)][(!result2), result2 := NA]

Hi, Thanks for your idea. I have updated my toy example to show that you suggestion leads to problems if there is a duplicated y, for which the condition for x is once fulfilled and once not fulfilled. F`thanks for pointing out the fcase overuse! That was a left over from trying to get a good toy example. — Markus, Apr 07 '23 at 14:04

score 0 · Answer 2 · answered Apr 07 '23 at 18:23

Based on the excellent answer by @r2evans, I came up with this solution, which can handle duplicate y values by selecting always only the first N entries at the cost of creating one temporary integer column. Any more elegant/efficient solution is appreciated.

# Create a sample data.table
dt <- data.table(x = rep(1:10, 2), y = rep(c(1:4, rep(5, 5), 6), 2), group = rep(c(1, 2), each = 10))

# So y is sorted
setorder(dt, group, -y)

# Uniqie ID per row
dt[, row_id := 1L:.N]

# Top 3 rows of each group where x fulfills the condition
dt[, result := row_id %in% head(row_id[x <= 8], 3), by = .(group)]
dt[, row_id:= NULL]
dt
#     x y group result
#  1: 10 6     1  FALSE
#  2:  5 5     1   TRUE
#  3:  6 5     1   TRUE
#  4:  7 5     1   TRUE
#  5:  8 5     1  FALSE
#  6:  9 5     1  FALSE
#  7:  4 4     1  FALSE
#  8:  3 3     1  FALSE
#  9:  2 2     1  FALSE
# 10:  1 1     1  FALSE
# 11: 10 6     2  FALSE
# 12:  5 5     2   TRUE
# 13:  6 5     2   TRUE
# 14:  7 5     2   TRUE
# 15:  8 5     2  FALSE
# 16:  9 5     2  FALSE
# 17:  4 4     2  FALSE
# 18:  3 3     2  FALSE
# 19:  2 2     2  FALSE
# 20:  1 1     2  FALSE

Using data.table modify a column for N highest values by group that fulfil one or multiple conditions

2 Answers2