3

My question is essentially a generalisation of this SO post but with a rolling component.

I have a dataset of people, jobs and the dates on which they held said jobs (whilst the specifics aren't important, they make the prose easier). Each person can hold multiple jobs on a given date, or they can have no job, which shows up as a missing person-date-job row.

I want to create a summarised table where there is only one row for each person-date combination, thus necessitating creating a list-column that holds the IDs for jobs held by that person-date. I have managed to do this for contemporaneous person-job-dates following the above linked SO post. The complicating factor is that I want to look backwards by 3 periods, i.e. I need the job_id list-col for date t to contain all the jobs held by a person in the date t, t-1 and t-2.

Below is some code to produce a toy input table and the desired output.

library(data.table)

# Input data
data <- data.table(
  ind_id = c(rep(1, 3), rep(2, 4), rep(3, 2), rep(4, 5)),
  date = c(1, 2, 3, 1, 2, 2, 3, 1, 3, 1, 1, 2, 2, 3),
  job_id = c("A", "A", "A", "B", "B", "C", "B", "D", "E", "F", "G", "F", "G", "G")
)

# Desired output
output <- data.table(
  ind_id = c(rep(1, 3), rep(2, 3), rep(3, 3), rep(4, 3)),
  date = rep(1:3, 4),
  job_id = list("A", "A", "A", "B", c("B", "C"), c("B", "C"), "D", c("D"), c("D", "E"), c("F", "G"), c("F", "G"), c("F", "G"))
)

And here is the code that works to make a table of contemporaneous person-job-date rows.

data_contemp <- data[, .(job_id = list(job_id)), by = .(date, ind_id)]

Something that I tried was to use frollapply but it doesn't work if the output is not numeric unfortunately: data[, all_jobs := frollapply(job_id, 3, list), by = ind_id]

Appreciate everyone's help on this!

EDIT: I should add that a data.table solution is highly preferred because the actual dataset is 607 million rows, data.table is faster and more memory efficient, and the syntax is better.

EDIT 2: Added some code to generate an arbitrarily large input table.

n <- 600e6
n <- round(n / 15)
  
t1 <- data.table(ind_id = rep(1, 3), date = 1:3, job_id = rep("A", 3))
t2 <- data.table(ind_id = rep(2, 3), date = 1:3, job_id = c("A", "B", "B"))
t3 <- data.table(ind_id = rep(3, 5), date = c(1, 2, 2, 3, 3), job_id = c("A", "A", "B", "A", "B"))
t4 <- data.table(ind_id = rep(4, 2), date = c(1, 3), job_id = c("A", "B"))
t5 <- data.table(ind_id = rep(5, 4), date = c(1, 1, 2, 3), job_id = c("A", "B", "A", "A"))
  
data <- rbind(t1, t2, t3, t4, t5)
data <- data[rep(seq_len(nrow(data)), n)]
data[, ind_id := rleid(ind_id)]
Aaron
  • 73
  • 6
  • Could you include code to generate 600M rows example data? I have rewritten frollapply last week. It supports other types as well now. It is not yet published. It won't be super fast but at least it's parallel so you can scale by throwing more CPUs. – jangorecki Oct 10 '22 at 00:31
  • anyway frollapply will not support list column (possibly could) but you can call it before wrapping your column into list, and return a list from frollapply. You could split your DT by person and use mclapply to parallelize over persons. – jangorecki Oct 10 '22 at 00:39
  • Added some code to generate an arbitrarily large input table. I'm not sure how calling frollapply before creating the list columns will work, because the state of the column that I want to work on depends on the data being presented in a particular way (that relies on list-cols, but if another structure can achieve the same end, of course I am not tied to this format). – Aaron Oct 10 '22 at 04:36
  • shouldn't be the 8th row excluded from the `output`? – jangorecki Oct 10 '22 at 11:09
  • frollapply won't scale on this particular data, use Waldi solution – jangorecki Oct 10 '22 at 11:34
  • My intention was for the output to be a balanced panel even if the input is missing data, but that's something I can fix myself outside the scope of this question. – Aaron Oct 10 '22 at 21:43

2 Answers2

2

You could use self non-equijoins:

data[,start:=date-2]
data[data,.(ind_id,date = x.date,job_id=i.job_id),on=.(ind_id, start<= date, date>=date)][
     ,.(job_id=list(unique(job_id))),.(ind_id,date)]

    ind_id  date job_id
     <num> <num> <list>
 1:      1     1      A
 2:      1     2      A
 3:      1     3      A
 4:      2     1      B
 5:      2     2    B,C
 6:      2     3    B,C
 7:      3     1      D
 8:      3     3    D,E
 9:      4     1    F,G
10:      4     2    F,G
11:      4     3    F,G

Slight difference compared to your expected output: date=2 isn't present for ind_id=3 because it isn't present in initial data.

    ind_id  date job_id
     <num> <int> <list>
8:      3     2      D
Waldi
  • 39,242
  • 6
  • 30
  • 78
0

Thanks Waldi for your solution. I actually managed to figure out my own solution to my question with a combination of helper columns and mapply. So I've included my method as an answer and also benchmarked both methods. Waldi's solution with non-equi joins is about 20 per cent faster than my method with mapply, but uses about 40 per cent more memory. This differential looks like it remains constant as the number of rows scales.

Given memory is cheap and time is not, Waldi's solution works best here.

Thanks everyone for contributing!

library(data.table)
library(collapse)

## Input data

# Create three types of people with different employment histories:
# Type 1: same job over time
# Type 2: changes to a new job in t2
# Type 3: picks up a new job in t2
# Type 4: employed in t1, unemployed in t2, employed in t3
# Type 5: loses a second job in t2

make_data <- function(n) {
  n <- round(n / 15)
  
  t1 <- data.table(ind_id = rep(1, 3), date = 1:3, job_id = rep("A", 3))
  t2 <- data.table(ind_id = rep(2, 3), date = 1:3, job_id = c("A", "B", "B"))
  t3 <- data.table(ind_id = rep(3, 5), date = c(1, 2, 2, 3, 3), job_id = c("A", "A", "B", "A", "B"))
  t4 <- data.table(ind_id = rep(4, 2), date = c(1, 3), job_id = c("A", "B"))
  t5 <- data.table(ind_id = rep(5, 4), date = c(1, 1, 2, 3), job_id = c("A", "B", "A", "A"))
  
  data <- rbind(t1, t2, t3, t4, t5)
  data <- data[rep(seq_len(nrow(data)), n)]
  data[, ind_id := rleid(ind_id)]
  
  data <- data[, .(job_id = list(job_id)), by = .(date, ind_id)]
  
  # Add back missing person-date rows to create balanced panel
  date_person_rows <- CJ(ind_id = unique(data$ind_id), date = unique(data$date))
  data <- date_person_rows[data, job_id := i.job_id, on = .(date, ind_id)]
  
  return(data)
}

method_1 <- function(data) {
  
  data[, paste0("jobs_", 0:2) := shift(.(job_id), 0:2), by = ind_id]
  data[, job_id := mapply(jobs_0, jobs_1, jobs_2, FUN = function(a, b, c) sort(na_rm(unique(c(a, b, c)))))]
  data[, c("jobs_0", "jobs_1", "jobs_2") := NULL]
  
  setkey(data, NULL) # For some reason this dt has a key set but the method 2 one doesn't
  
  return(data)
         
}

method_2 <- function(data) {
  
  data[, start := date - 2]
  data <- 
    data[data, .(ind_id, date = x.date, job_id = i.job_id), on = .(ind_id, start <= date, date >= date)][, .(job_id = list(unique(job_id))), .(ind_id, date)]
  data[, job_id := lapply(job_id, function(x) unique(unlist(x)))]
  

}

# Benchmark
bench::mark(
  method_1(make_data(10e4)),
  method_2(make_data(10e4)),
  iterations = 1L
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression                      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 method_1(make_data(1e+05))     2.7s     2.7s     0.370    27.4MB     17.8
#> 2 method_2(make_data(1e+05))    2.08s    2.08s     0.481      43MB     14.9

Created on 2022-10-11 with reprex v2.0.2

Aaron
  • 73
  • 6