I'd like to fill in missing years per ID. This is easy for the small example below.
# Create example data table.
dt <- data.table(id = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
value = rnorm(10),
time = c(1, 2, 3, 3, 5, 6, 7, 2, 3, 6))
# Sort by time variable.
setkey(dt, time)
# Fill in the gaps.
system.time(
dt <- dt[, .SD[J(min(time):max(time))], by=id]
)
# Sort by ID and time, then print.
setkey(dt, id, time)[]
Gives
> dt
id value time
1: 1 -0.9062227 1
2: 1 2.0822289 2
3: 1 0.5073055 3
4: 2 0.3673813 3
5: 2 NA 4
6: 2 0.3726807 5
7: 2 -0.7381199 6
8: 2 0.7048979 7
9: 3 -0.7852230 2
10: 3 0.2327946 3
11: 3 NA 4
12: 3 NA 5
13: 3 -0.3430340 6
The years are now continuous and NAs have been added for missing values, exactly what I wanted.
However, this solution takes forever on a much larger data.table.
# Create big example data table
n <- 1e5
dt <- data.table(id = rep(1:(n/4), each=4),
value = rnorm(n),
year = sample(1997:2001, n, replace=TRUE))
# Remove duplicate years.
setkey(dt, id, year)
dt <- unique(dt)
# Fill in the gaps.
setkey(dt, year)
system.time(
dt2 <- dt[, .SD[J(min(year):max(year))], by=id]
)
Takes roughly 20 seconds for those ~100000 rows.
I would like to do this for a data table with 100 million rows. There must be a speedier way?