2

My question is very similar to this one: How to extract the first n rows per group?

dt
         date age     name       val
1: 2000-01-01   3   Andrew  93.73546
2: 2000-01-01   4      Ben 101.83643
3: 2000-01-01   5  Charlie  91.64371
4: 2000-01-02   6     Adam 115.95281
5: 2000-01-02   7      Bob 103.29508
6: 2000-01-02   8 Campbell  91.79532

We have a dt and I've added an extra column named val. First, we want to extract the first n rows within each group. The solutions from the link provided are:

dt[, .SD[1:2], by=date] # where 1:2 is the index needed
dt[dt[, .I[1:2], by = date]$V1] # for speed

My question is how do I apply a function to the first n rows within each group if that function depends on the subsetted information. I am trying to apply something like this:

  # uses other columns for results/ is dependent on subsetted rows
  # but keep it simple for replication
do_something <- function(dt){
  res <- ifelse(cumsum(dt$val) > 200, 1, 0)  
  return(res)
}
# first 2 rows of dt by group=date
x <- dt[, .SD[1:2], by=date]
# apply do_something to first 2 rows of dt by group=date
x[, list('age'=age,'name'=name,'val'=val, 'funcVal'= do_something(.SD[1:2])),by=date]

          date age   name       val funcVal
1: 2000-01-01   3 Andrew  93.73546       0
2: 2000-01-01   4    Ben 101.83643       1
3: 2000-01-02   6   Adam 115.95281       0
4: 2000-01-02   7    Bob 103.29508       1

Am I going about this wrong? Is there a more efficient way to do this? I cannot seem to figure out how to apply the "for speed" solution to this. Is there a way to do this without saving the subset-ed results first and applying the function to the first 2 rows by date right away?

Any help is appreciated and below is the code to produce the data above:

date <- c("2000-01-01","2000-01-01","2000-01-01",
          "2000-01-02","2000-01-02","2000-01-02")
age <- c(3,4,5,6,7,8)
name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell")
val <- val <- rnorm(6,100,10)
dt <- data.table(date, age, name,val)
Frank
  • 66,179
  • 8
  • 96
  • 180
road_to_quantdom
  • 1,341
  • 1
  • 13
  • 20

2 Answers2

5

In case there's more than one grouping column, it might be more efficient to collapse to one:

m = dt[, .(g = .GRP, r = .I[1:2]), by = date]
dt[m$r, v := ff(.SD), by=m$g, .SDcols="val"]

This is just an extension to @eddi's approach (of keeping row numbers .I, seen in @akrun's answer) to also keep group counter .GRP.


Re OP's comment that they're more concerned about the function, well, borrowing from @akrun, there's ...

ff = function(x) as.integer(cumsum(x[[1]]) > 200)

Assuming all values are nonnegative, you could probably handle this in C more efficiently, since the cumulative sum can stop as soon as the threshold is reached. For the special case of two rows, that will hardly matter, though.

My impression is that this is a dummy function so there's no point going there. Many efficiency improvements that I usually think of are contingent on the function and data.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • My next problem was going to be grouping by more than one column so thanks for the pre-emptive solution! Is it worth trying to give a better example of the function I'm implementing and editing my question? – road_to_quantdom Oct 22 '18 at 15:57
  • Cool, glad it's helpful. I guess a new question might make more sense, since there are answers already re the current question and editing would make them incomplete – Frank Oct 22 '18 at 16:01
3

We can use as.integer on the cumsum to coerce the logical to binary. Extract the row index, specify it as i, grouped by 'date', apply the function on the 'val' column

f1 <- function(x) as.integer(cumsum(x) > 200)
i1 <- dt[, .I[1:2], by = date]$V1
dt[i1, newcol := f1(val), date]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Very helpful. But my concern is with simplification of `f1`. The function I'm actually using inputs a `data.table` and returns a vector. How would you adjust your code above so its not `f1(val)` but the function of the subsetted data.table? – road_to_quantdom Oct 22 '18 at 15:43