Only get data.table groups with a given number of rows

Question

I have the following data.table:

dt = data.table(year=c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2),
                quar=c(1, 1, 1, 2, 2, 3, 4, 4, 4, 1, 1, 1),
                item=c(1, 2, 3, 1, 2, 1, 1, 2, 3, 1, 2, 3))

Some periods (defined by year and quarter) have three items:

Y1Q1, Y1Q4, Y2Q1

The other periods don't:

Y1Q2 has 2 items
Y1Q3 has 1 item

How can I only get the rows which have all three items?

What I've managed so far is

dt[dt[, "i" := nrow(.SD) == 3, .(year, quar)]$i]
> year | quar | item | i
> -----+------+------+-----
> 1    | 1    | 1    | TRUE
> 1    | 1    | 2    | TRUE
> 1    | 1    | 3    | TRUE
> 1    | 4    | 1    | TRUE
> 1    | 4    | 2    | TRUE
> 1    | 4    | 3    | TRUE
> 2    | 1    | 1    | TRUE
> 2    | 1    | 2    | TRUE
> 2    | 1    | 3    | TRUE

Which groups by year and quar and then sets a column i which states whether that group is valid. All rows in the group get the calculated value of i.

Which works just fine. However, it has the side-effect of adding a real i column to the table.

I tried using a temporary column declared with .(i =...), but then the i column has the length of the shorter, grouped table and we get

dt[dt[, .(i = nrow(.SD) == 3), .(year, quar)]$i]
> Error in `[.data.table`(dt, dt[, .(i = nrow(.SD) == 3), .(year, quar)]$i) :
> i evaluates to a logical vector length 5 but there are 12 rows. [...]

So, is there a more elegant way of solving this? Or should I just use this and then drop i?

@akrun, it's identical to the output I've managed so far, minus the `i` column. — Wasabi, Jun 27 '19 at 15:26
@akrun, no, the entire table minus `i`. So that's Year, Quarter and Item. — Wasabi, Jun 27 '19 at 15:30

akrun · Answer 1 · 2019-06-27T15:37:58.977

3

If we need to subset, use the .I to get the row indices and subset

dt[dt[, .I[.N == 3], .(year, quar)]$V1]
#    year quar item
#1:    1    1    1
#2:    1    1    2
#3:    1    1    3
#4:    1    4    1
#5:    1    4    2
#6:    1    4    3
#7:    2    1    1
#8:    2    1    2
#9:    2    1    3

Or with .SD, but could be slow

dt[, .SD[.N == 3], .(year, quar)]

Or another option is if/else

dt[, if(.N == 3) .SD, .(year, quar)]

edited Jun 27 '19 at 15:37

answered Jun 27 '19 at 15:32

akrun

874,273
37
540
662

1

Yep, that does it. For some reason I misremembered `.I` as giving the rows in each grouped "sub-table". – Wasabi Jun 27 '19 at 15:44
```dt[dt[, .I[.N == 3], by = c("year", "quar")][,V1]]``` This will be a bit more versatile. this part I mean: `[,V1]` – M-- Jun 27 '19 at 15:49
1

@M-M You can use either `$` or `[` as the default column name wiould be `V1` – akrun Jun 27 '19 at 15:51
sure thing. I meant if you wanted to pass `V1` as an argument `$V1` wouldn't work. As I said, this is more versatile, not that the other one is wrong. – M-- Jun 27 '19 at 15:53

chinsoon12 · Answer 2 · 2019-06-28T05:47:22.420

Another option using join:

dt[dt[, .N, .(year, quar)][N==3], on=.(year, quar)]

edit: to address akrun comment, speed really depends on the characteristics of the dataset. Here are the timings for a sample dataset:

set.seed(0L)
ngrp <- 1e6
x <- sample(1:3, ngrp, TRUE)
dt <- data.table(year=rep(1:ngrp, times=x))[,
    quar:=year]

microbenchmark::microbenchmark(
    mtd0=dt[dt[, .I[.N == 3], .(year, quar)]$V1],
    mtd1=dt[dt[, .N, .(year, quar)][N==3], on=.(year, quar)],
    times=3L
)

timings:

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval cld
 mtd0 851.4689 878.5954 891.4074 905.7219 911.3766 917.0314     3   b
 mtd1 268.4806 295.0696 317.0266 321.6586 341.2996 360.9407     3  a

Is it faster when compared to `.I` – akrun Jun 28 '19 at 02:44 — akrun, Jun 28 '19 at 02:44

score 1 · Answer 3 · answered Jun 27 '19 at 16:17

Of course this is a data.table question, and @akrun's answer covers it but for the sake of completion;

A dplyr solution:

library(dplyr)
dt %>% group_by(year,quar) %>% filter(n()==3)

## A tibble: 9 x 3
## Groups:   year, quar [3]
#   year  quar  item
#  <dbl> <dbl> <dbl>
#1     1     1     1
#2     1     1     2
#3     1     1     3
#4     1     4     1
#5     1     4     2
#6     1     4     3
#7     2     1     1
#8     2     1     2
#9     2     1     3

In Base:

dt[ave(dt$item, dt[,c("year","quar")], FUN = length) == 3, ]

or

freqt <- as.data.frame(table(dt[,c("year", "quar")]))
subt <- freqt[freqt$Freq == 3,c("year", "quar")]

merge(x = subt, y = dt, by = c("year", "quar") , all.x = TRUE)

Only get data.table groups with a given number of rows

3 Answers3