1

I currently have a large data table and I would like to accumulate a vector column (the classes column) for each group (id) along the years to get all past classes up to the current year in vector format.

EDIT: Previous topics (ie Cumulatively paste (concatenate) values grouped by another variable) have answerd this question in the case of characters concatenation which I don't want (because analyzing strings forces me to parse it before, which is cumputer intensive on large datasets). I would like to accumulate the vectors and get a column of vectors as well. I think the solution is pretty close but I just can't manage to find the right syntax for it.

 

Sample data:

id     year   classes
----------------------------
1      2000   c("A", "B")
1      2001   c("C", "A")
1      2002   "D"
1      2003   "E"
2      2001   "A"
2      2002   c("A", "D")
2      2003   "E"
...

Expected output :

id     year   classes       cumclasses
-----------------------------------------------------------
1      2000   c("A", "B")   c("A", "B")
1      2001   c("C", "A")   c("A", "B", "C", "A")
1      2002   "D"           c("A", "B", "C", "A", "D")
1      2003   "E"           c("A", "B", "C", "A", "D", "E")
2      2001   "A"           "A"
2      2002   c("A", "D")   c("A", "A", "D")
2      2003   "E"           c("A", "A", "D", "E")
...

My goal is to find an efficient solution because my dataset is fairly large.

For now I have a working (but ultra slow) solution using dplyr and purrr :

dt2 <- dt %>%
   setkeyv(c("id", "year")) %>%
   group_by(id) %>%
   mutate(cumclasses = accumulate(classes, append))

I'm looking for a data.table solution of the type:

#not working example
dt2 <- dt[, cumclasses := accumulate(classes, append), by = id]

or even a base R solution, the faster the better !

Thank you!

If you want to reproduce sample data please copy the following code:

dt <- data.table(id = 
                  c(1,1,1,1,2,2,2),
                year = 
                  c(2000,2001,2002,2003,2001,2002,2003),
                classes = 
                  list(c('A', 'B'), c('C', 'A'), 'D', 'E', 'A', c('A', 'D'), 'E'), key = 'id')

EDIT [SOLVED]:

A working solution is (using data.table and purrr):

dt[, cumClasses := list(accumulate(classes, append)), by = id]
Samsa
  • 148
  • 8

1 Answers1

1

One option would be to group by 'id', loop over the sequence of rows and extract the 'Classes' and paste it to together after unlisting the list column

dt[,  cumClasses := sapply(seq_len(.N), function(i) toString(unlist(classes[seq_len(i)]))), id][, 
         cumClasses := as.list(cumClasses)][]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you for your answer. But this gives me a column of characters where values are comma separated and not a column of vectors. – Samsa Sep 06 '19 at 09:18
  • @Samsa okay, i thought you want to join them as string. If you want to have it as a list of vectorst, then remove the `toString` and also change the `sapply` to `lapply` – akrun Sep 06 '19 at 13:54
  • writing `dt[, cumClasses := lapply(seq_len(.N), function(i) unlist(classes[seq_len(i)])), by = id]` leads to the following error: `Supplied 2 items to be assigned to group 1 of size 4 in column 'cumClasses'. The RHS [...]` – Samsa Sep 09 '19 at 09:32