Calculations based on a dynamic subgroup of a data.table

Question

My question is related to Subset by group with data.table but different.

Imagine a data set like this:

tmp <- data.table(x = 1:10, y = c(27, 70, 54, 18, 50, 44, 22, 73, 6, 5))

For each row of the data, I want to calculate a new value, z, which is the min(y) for all rows with a larger value of x. For instance, for the third row of the data where x is 3, I want min(y) among rows with x > 3 (which would be the value 5). For our intents and purposes, you can assume the data is already ordered by x.

At first I thought of using a function like this:

min.y <- function(val, dt) {
  dt[x > val, min(y)]
}

But calling tmp[, z:= fun(x, tmp)] will result in a warning message:

In min(y) : no non-missing arguments to min; returning Inf

What is the proper way to do this?

PS: Obviously, for the last row I expect to get NA as the result

Mike H. · Answer 1 · 2018-12-08T06:28:26.347

Method 1:

Since you said we can assume the data is sorted on x you could use a cumulative minimum starting from the end of y. We cut out the first observation so that we do a > search rather than >=:

tmp$min_y <- c(rev(cummin(rev(tmp$y[-1]))), NA)

Update: The old method effectively did a >= search rather than >. Updated to do >.

Method 2: Data.table

If you wanted to use data.table you could try to group by each row and then subset within J. The ifelse is needed so that when we are at our last row we don't take the min of no values:

tmp[, "min_y" := {curr_x <- x
                  tmp_subs <- tmp[x > curr_x]
                  ifelse(nrow(tmp_subs)>0, min(tmp[x > curr_x][["y"]]), NA_real_)},
    by = 1:nrow(tmp)]

tmp
#     x  y min_y
# 1:  1 27     5
# 2:  2 70     5
# 3:  3 54     5
# 4:  4 18     5
# 5:  5 50     5
# 6:  6 44     5
# 7:  7 22     5
# 8:  8 73     5
# 9:  9  6     5
#10: 10  5     NA

Since 5 is the smallest value which is at the end everything will be 5. Let's make this a little more interesting:

tmp <- data.table(x = 1:10, y = c(27, 70, 54, 18, 50, 44, 22, 73, 47, 58))

Our result will be:

#     x  y min_y
# 1:  1 27    18
# 2:  2 70    18
# 3:  3 54    18
# 4:  4 18    22
# 5:  5 50    22
# 6:  6 44    22
# 7:  7 22    47
# 8:  8 73    47
# 9:  9 47    58
#10: 10 58    NA

score 3 · Answer 2 · answered Dec 06 '18 at 20:09

One option would be a self-non-equi join

tmp[, min_y := .SD[.SD, min(y, na.rm = TRUE), on = .(x > x),
         by = .EACHI]$V1][is.infinite(min_y), min_y := NA_real_][]
#      x  y min_y
# 1:  1 27     5
# 2:  2 70     5
# 3:  3 54     5
# 4:  4 18     5
# 5:  5 50     5
# 6:  6 44     5
# 7:  7 22     5
# 8:  8 73     5
# 9:  9  6     5
#10: 10  5    NA

score 2 · Answer 3 · answered Dec 06 '18 at 20:12

2

For what its worth, one more possible approach (not sure if its any better or worse than the others):

tmp[, z := min(tmp$y[(.I+1):NROW(tmp)]), by = 1:NROW(tmp)]

answered Dec 06 '18 at 20:12

dww

30,425
5
68
111

G. Grothendieck · Answer 4 · 2018-12-08T12:27:49.303

Here are some solutions:

1) rollapply Assuming tmp is sorted (sort it if not), we can use rollapply to get a compact solution like this. Note that when the width argument of rollapply is a list its elements are regarded as vectors of offsets over which to apply min.

library(data.table)
library(zoo)

tmp[, min := rollapply(y, lapply(pmax(.N:1-1, 1), seq), min, fill = NA)]

giving:

     x  y min
 1:  1 27   5
 2:  2 70   5
 3:  3 54   5
 4:  4 18   5
 5:  5 50   5
 6:  6 44   5
 7:  7 22   5
 8:  8 73   5
 9:  9  6   5
10: 10  5  NA

2) sqldf Using SQL join tmp to itself using the indicated condition and take the min over the group like this. tmp need not be sorted.

library(data.table)
library(sqldf)

sqldf("select a.*, min(b.y) min 
       from tmp a left join tmp b on b.x > a.x group by a.rowid")

giving:

    x  y min
1   1 27   5
2   2 70   5
3   3 54   5
4   4 18   5
5   5 50   5
6   6 44   5
7   7 22   5
8   8 73   5
9   9  6   5
10 10  5  NA

Calculations based on a dynamic subgroup of a data.table

4 Answers4