data.table: is it possible to merge .SD and return a new 'sub data table' by group?

Question

I have a data table organized by id and year, with a frequency (freq) value for every year where the frequency is at least 1. The start and end year may differ for every id.

Example:

> dt <- data.table(id=c('A','A','A','A','B','B','B','B'),year=c(2010,2012,2013,2015,2006,2007,2010,2011),freq=c(2,1,4,3,1,3,5,7))
> dt
   id year freq
1:  A 2010    2
2:  A 2012    1
3:  A 2013    4
4:  A 2015    3
5:  B 2006    1
6:  B 2007    3
7:  B 2010    5
8:  B 2011    7

I would like to make each time series by id complete, i.e. add rows with freq=0 for any missing year. So the result for the example above should look like this:

 id year freq
  A 2010    2
  A 2011    0
  A 2012    1
  A 2013    4
  A 2014    0
  A 2015    3
  B 2006    1
  B 2007    3
  B 2008    0
  B 2009    0
  B 2010    5
  B 2011    7

I'm starting with data.table and I'm interested to see if this is doable. With plyr or dplyr I would have used a merge operation with a complete column of years for every sub dataframe by id. Is there an equivalent to this solution with data.table?

This should get you going: [data.table equivalent of tidyr::complete()](https://stackoverflow.com/questions/43483497/data-table-equivalent-of-tidyrcomplete), [data.table equivalent of tidyr::complete with group_by](https://stackoverflow.com/questions/47095292/data-table-equivalent-of-tidyrcomplete-with-group-by), and Linked therein. — Henrik, Nov 02 '21 at 18:20
@Henrik, I think `CJ` and `unique` won't work here since it's by-`id`, where `A` should not get year `2006`. But a `seq`-based solution should work: https://stackoverflow.com/q/69748340/3358272 is a previous answer for that (though it uses `seq.Date` instead of just `seq`, the premise is the same). — r2evans, Nov 02 '21 at 18:23
(That comment was posted before I saw your second link added, @Henrik) — r2evans, Nov 02 '21 at 18:29

score 2 · Accepted Answer · answered Nov 02 '21 at 18:20

We can't use CJ-based approaches because the missing rows need to be by-id. An alternative is:

library(data.table)
dt[ dt[, .(year = do.call(seq, as.list(range(year)))), by = .(id)],
    on = .(id, year)
  ][is.na(freq), freq := 0][]
#         id  year  freq
#     <char> <int> <num>
#  1:      A  2010     2
#  2:      A  2011     0
#  3:      A  2012     1
#  4:      A  2013     4
#  5:      A  2014     0
#  6:      A  2015     3
#  7:      B  2006     1
#  8:      B  2007     3
#  9:      B  2008     0
# 10:      B  2009     0
# 11:      B  2010     5
# 12:      B  2011     7

Thanks! I tested this solution with my real data (10m rows) and it's the fastest of the 3 answers, `system.time` gives `7.713 0.365 7.306`. — Erwan, Nov 02 '21 at 23:09

score 1 · Answer 2 · answered Nov 02 '21 at 18:46

Another solution, maybe more explicit than @r2evans'? First make a table of complete series:

years <- dt[, list(year= seq(min(year), max(year))), by= id]
years
    id year
 1:  A 2010
 2:  A 2011
 3:  A 2012
 4:  A 2013
 5:  A 2014
 6:  A 2015
 7:  B 2006
 8:  B 2007
 9:  B 2008
10:  B 2009
11:  B 2010
12:  B 2011

then merge and replace NAs:

full <- merge(dt, years, all.y= TRUE)
full[, freq := ifelse(is.na(freq), 0, freq)]

full
    id year freq
 1:  A 2010    2
 2:  A 2011    0
 3:  A 2012    1
 4:  A 2013    4
 5:  A 2014    0
 6:  A 2015    3
 7:  B 2006    1
 8:  B 2007    3
 9:  B 2008    0
10:  B 2009    0
11:  B 2010    5
12:  B 2011    7

Thanks! I tested this solution with my real data (10m rows) and it's almost as fast as the fastest of the 3 answers, `system.time` gives `8.549 0.932 7.860`. — Erwan, Nov 02 '21 at 23:13

score 1 · Answer 3 · answered Nov 02 '21 at 20:09

1

Here is another data.table way to solve your problem:

dt[, .SD[.(min(year):max(year)), on="year"], by=id][is.na(freq), freq:=0]

#         id  year  freq
#     <char> <int> <num>
#  1:      A  2010     2
#  2:      A  2011     0
#  3:      A  2012     1
#  4:      A  2013     4
#  5:      A  2014     0
#  6:      A  2015     3
#  7:      B  2006     1
#  8:      B  2007     3
#  9:      B  2008     0
# 10:      B  2009     0
# 11:      B  2010     5
# 12:      B  2011     7

answered Nov 02 '21 at 20:09

B. Christian Kamgang

5,098
5
10

Thanks! I like how concise your solution is, but it turned out to be much slower than the other two answers: with my real data (10m rows) `system.time` gives `283.120 0.814 283.784 `. – Erwan Nov 02 '21 at 23:11

data.table: is it possible to merge .SD and return a new 'sub data table' by group?

3 Answers3