My question is essentially a generalisation of this SO post but with a rolling component.
I have a dataset of people, jobs and the dates on which they held said jobs (whilst the specifics aren't important, they make the prose easier). Each person can hold multiple jobs on a given date, or they can have no job, which shows up as a missing person-date-job row.
I want to create a summarised table where there is only one row for each person-date combination, thus necessitating creating a list-column that holds the IDs for jobs held by that person-date. I have managed to do this for contemporaneous person-job-dates following the above linked SO post. The complicating factor is that I want to look backwards by 3 periods, i.e. I need the job_id
list-col for date t to contain all the jobs held by a person in the date t, t-1 and t-2.
Below is some code to produce a toy input table and the desired output.
library(data.table)
# Input data
data <- data.table(
ind_id = c(rep(1, 3), rep(2, 4), rep(3, 2), rep(4, 5)),
date = c(1, 2, 3, 1, 2, 2, 3, 1, 3, 1, 1, 2, 2, 3),
job_id = c("A", "A", "A", "B", "B", "C", "B", "D", "E", "F", "G", "F", "G", "G")
)
# Desired output
output <- data.table(
ind_id = c(rep(1, 3), rep(2, 3), rep(3, 3), rep(4, 3)),
date = rep(1:3, 4),
job_id = list("A", "A", "A", "B", c("B", "C"), c("B", "C"), "D", c("D"), c("D", "E"), c("F", "G"), c("F", "G"), c("F", "G"))
)
And here is the code that works to make a table of contemporaneous person-job-date rows.
data_contemp <- data[, .(job_id = list(job_id)), by = .(date, ind_id)]
Something that I tried was to use frollapply
but it doesn't work if the output is not numeric unfortunately: data[, all_jobs := frollapply(job_id, 3, list), by = ind_id]
Appreciate everyone's help on this!
EDIT: I should add that a data.table
solution is highly preferred because the actual dataset is 607 million rows, data.table
is faster and more memory efficient, and the syntax is better.
EDIT 2: Added some code to generate an arbitrarily large input table.
n <- 600e6
n <- round(n / 15)
t1 <- data.table(ind_id = rep(1, 3), date = 1:3, job_id = rep("A", 3))
t2 <- data.table(ind_id = rep(2, 3), date = 1:3, job_id = c("A", "B", "B"))
t3 <- data.table(ind_id = rep(3, 5), date = c(1, 2, 2, 3, 3), job_id = c("A", "A", "B", "A", "B"))
t4 <- data.table(ind_id = rep(4, 2), date = c(1, 3), job_id = c("A", "B"))
t5 <- data.table(ind_id = rep(5, 4), date = c(1, 1, 2, 3), job_id = c("A", "B", "A", "A"))
data <- rbind(t1, t2, t3, t4, t5)
data <- data[rep(seq_len(nrow(data)), n)]
data[, ind_id := rleid(ind_id)]