3

I have a data table organized by id and year, with a frequency (freq) value for every year where the frequency is at least 1. The start and end year may differ for every id.

Example:

> dt <- data.table(id=c('A','A','A','A','B','B','B','B'),year=c(2010,2012,2013,2015,2006,2007,2010,2011),freq=c(2,1,4,3,1,3,5,7))
> dt
   id year freq
1:  A 2010    2
2:  A 2012    1
3:  A 2013    4
4:  A 2015    3
5:  B 2006    1
6:  B 2007    3
7:  B 2010    5
8:  B 2011    7

I would like to make each time series by id complete, i.e. add rows with freq=0 for any missing year. So the result for the example above should look like this:

 id year freq
  A 2010    2
  A 2011    0
  A 2012    1
  A 2013    4
  A 2014    0
  A 2015    3
  B 2006    1
  B 2007    3
  B 2008    0
  B 2009    0
  B 2010    5
  B 2011    7

I'm starting with data.table and I'm interested to see if this is doable. With plyr or dplyr I would have used a merge operation with a complete column of years for every sub dataframe by id. Is there an equivalent to this solution with data.table?

Erwan
  • 1,385
  • 1
  • 12
  • 22
  • This should get you going: [data.table equivalent of tidyr::complete()](https://stackoverflow.com/questions/43483497/data-table-equivalent-of-tidyrcomplete), [data.table equivalent of tidyr::complete with group_by](https://stackoverflow.com/questions/47095292/data-table-equivalent-of-tidyrcomplete-with-group-by), and Linked therein. – Henrik Nov 02 '21 at 18:20
  • @Henrik, I think `CJ` and `unique` won't work here since it's by-`id`, where `A` should not get year `2006`. But a `seq`-based solution should work: https://stackoverflow.com/q/69748340/3358272 is a previous answer for that (though it uses `seq.Date` instead of just `seq`, the premise is the same). – r2evans Nov 02 '21 at 18:23
  • (That comment was posted before I saw your second link added, @Henrik) – r2evans Nov 02 '21 at 18:29

3 Answers3

2

We can't use CJ-based approaches because the missing rows need to be by-id. An alternative is:

library(data.table)
dt[ dt[, .(year = do.call(seq, as.list(range(year)))), by = .(id)],
    on = .(id, year)
  ][is.na(freq), freq := 0][]
#         id  year  freq
#     <char> <int> <num>
#  1:      A  2010     2
#  2:      A  2011     0
#  3:      A  2012     1
#  4:      A  2013     4
#  5:      A  2014     0
#  6:      A  2015     3
#  7:      B  2006     1
#  8:      B  2007     3
#  9:      B  2008     0
# 10:      B  2009     0
# 11:      B  2010     5
# 12:      B  2011     7
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Thanks! I tested this solution with my real data (10m rows) and it's the fastest of the 3 answers, `system.time` gives `7.713 0.365 7.306`. – Erwan Nov 02 '21 at 23:09
1

Another solution, maybe more explicit than @r2evans'? First make a table of complete series:

years <- dt[, list(year= seq(min(year), max(year))), by= id]
years
    id year
 1:  A 2010
 2:  A 2011
 3:  A 2012
 4:  A 2013
 5:  A 2014
 6:  A 2015
 7:  B 2006
 8:  B 2007
 9:  B 2008
10:  B 2009
11:  B 2010
12:  B 2011

then merge and replace NAs:

full <- merge(dt, years, all.y= TRUE)
full[, freq := ifelse(is.na(freq), 0, freq)]

full
    id year freq
 1:  A 2010    2
 2:  A 2011    0
 3:  A 2012    1
 4:  A 2013    4
 5:  A 2014    0
 6:  A 2015    3
 7:  B 2006    1
 8:  B 2007    3
 9:  B 2008    0
10:  B 2009    0
11:  B 2010    5
12:  B 2011    7
dariober
  • 8,240
  • 3
  • 30
  • 47
  • Thanks! I tested this solution with my real data (10m rows) and it's almost as fast as the fastest of the 3 answers, `system.time` gives `8.549 0.932 7.860`. – Erwan Nov 02 '21 at 23:13
1

Here is another data.table way to solve your problem:

dt[, .SD[.(min(year):max(year)), on="year"], by=id][is.na(freq), freq:=0]

#         id  year  freq
#     <char> <int> <num>
#  1:      A  2010     2
#  2:      A  2011     0
#  3:      A  2012     1
#  4:      A  2013     4
#  5:      A  2014     0
#  6:      A  2015     3
#  7:      B  2006     1
#  8:      B  2007     3
#  9:      B  2008     0
# 10:      B  2009     0
# 11:      B  2010     5
# 12:      B  2011     7
  • Thanks! I like how concise your solution is, but it turned out to be much slower than the other two answers: with my real data (10m rows) `system.time` gives `283.120 0.814 283.784 `. – Erwan Nov 02 '21 at 23:11