How to select the first n rows of each group in specified columns (after a join)?

Question

I want to combine the functionality of .SD with by = in a non-equi join:

data.table - select first n rows within group

Example data:

tmp_dt1<- data.table(grp = c(1,2), time = c(0.2, 0.6, 0.4, 0.8, 0.25, 0.65))
tmp_dt2 <- data.table(grp = c(1,2), time_from = c(0.1, 0.5))
tmp_dt2 <- tmp_dt2[, time_to := time_from + 0.2]

> tmp_dt1
   grp time
1:   1 0.20
2:   2 0.60
3:   1 0.40
4:   2 0.80
5:   1 0.25
6:   2 0.65
> tmp_dt2
   grp time_from time_to
1:   1       0.1     0.3
2:   2       0.5     0.7

Now, my desired output is the first time in each group that lies between the ranges defined in tmp_dt2. I can get all such times by:

> tmp_dt1[tmp_dt2, .(grp, time = x.time, time_from, time_to), on = .(grp, time >= time_from, time <= time_to)]
   grp time time_from time_to
1:   1 0.20       0.1     0.3
2:   1 0.25       0.1     0.3
3:   2 0.60       0.5     0.7
4:   2 0.65       0.5     0.7

However, I am having some trouble extracting the first n rows from each grp using by, without chaining. As an example, when n = 1, the desired output is:

tmp_dt1[tmp_dt2, .(grp, time = x.time, time_from, time_to), 
        on = .(grp, time >= time_from, time <= time_to)][, .SD[1], by = grp]

       grp time time_from time_to
1:   1  0.2       0.1     0.3
2:   2  0.6       0.5     0.7

but, something like:

> tmp_dt1[tmp_dt2, .(time = x.time[1], time_from[1], time_to[1]), on = .(grp, time >= time_from, time <= time_to), by = grp]
Error in `[.data.table`(tmp_dt1, tmp_dt2, .(time = x.time[1], time_from[1],  : 
  object 'time_from' not found

does not work.

Using, .SD comes close, but gives me a confusing end of result in terms of the columns selected:

tmp_dt1[tmp_dt2, .SD[1], on = .(grp, time >= time_from, time <= time_to), by = grp]
   grp time
1:   1  0.2
2:   2  0.6

The reason why I do not want to do it in a chain is because of memory issues. Please note, I am only interested in solving this particular problem with the data.table package.

score 2 · Answer 1 · answered Jun 20 '17 at 05:25

2

One option is to specify mult= first

tmp_dt1[tmp_dt2, .(grp, time = x.time, time_from, time_to), mult = "first", 
             on = .(grp, time >= time_from, time <= time_to)]
#    grp time time_from time_to
#1:   1  0.2       0.1     0.3
#2:   2  0.6       0.5     0.7

answered Jun 20 '17 at 05:25

akrun

874,273
37
540
662

thanks, what about "first n"? (made edit to question to remove emphasis on the first row) – Alex Jun 20 '17 at 05:26
@Alex `mult` has only options for `first`, `last` and `all`. You can check [here](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html) For a general case,you already have the answer – akrun Jun 20 '17 at 05:31
thanks, so there is no way to avoid the creation of the first big data table and then subsetting it? – Alex Jun 20 '17 at 05:33

score 2 · Accepted Answer · answered Jun 20 '17 at 21:55

2

Have you tried

tmp_dt1[tmp_dt2, on=.(grp, time>=time_from, time<=time_to), 
    x.time, by=.EACHI] # or head(x.time, 2L) to get first 2 rows etc.

?

You'll need to rename the duplicate columns by yourself until that's taken care of internally, as explained here.

answered Jun 20 '17 at 21:55

Arun

116,683
26
284
387

1

thank you for this answer, and also the very helpful link that explains the `x.` notation – Alex Jun 20 '17 at 23:48

score 1 · Answer 3 · answered Jun 20 '17 at 17:31

If you want to minimize the memory usage another solution may be more memory efficient than the original chaining approach even though it looks strange to store a temporary result in a variable (but it contains only two columns and only the first n rows per group) and still use chaining (but on a smaller subset of the original data):

n = 1       # parameter: first "n" rows per group
selected.rows <- tmp_dt1[tmp_dt2, .(rownum = .I[1:n]), on = .(grp, time >= time_from, time <= time_to), by = grp]
tmp_dt1[selected.rows$rownum][tmp_dt2, .(grp, time = x.time, time_from, time_to), on = .(grp, time >= time_from, time <= time_to)]

Not very elegant and maybe slower (it duplicates the join logic and requires to join twice - even though on a reduced sub set in the second case)...

The temporary result set contains the row number of each "match" in the original data table (using the .I symbol of data.table):

selected.rows

   grp rownum
1:   1      1
2:   2      2

It would be great to compare this solution with chaining using a real big data table... (if I have more time I will profile this)

How to select the first n rows of each group in specified columns (after a join)?

3 Answers3