4

Given a data.table, I would like to extract cumulative unique elements until it reachs three unique values, than reset and resume:

y <- data.table(a=c(1, 2, 2, 3, 3, 4, 3, 2, 2, 5, 6, 7, 9, 8))

The desired output unique_acc_roll_3 is:

a   unique_acc_roll_3
1                   1
2                 1 2
2                 1 2
3               1 2 3
3               1 2 3  
4                   4  #4 is the forth element, so it resets and start again
3                 3 4
2               2 3 4
2               2 3 4
5                   5  #5 is the forth element, so it resets and start again 
6                 5 6
7               5 6 7
9                   9  #9 is the forth element, so it resets and start again
8                 8 9

Because it refers back recursively, I really got stucked... Real data is large, so data.table solutions would be great.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
Fabio Correa
  • 1,257
  • 1
  • 11
  • 17

2 Answers2

4

I can't think of any way to avoid a for loop essentially, except to hide it behind a Reduce call. My logic is to keep union-ing each new value at each row, until the set grows to length == n, at which point the new value is used as the starting point to the next iteration of the loop.

unionlim <- function(x, y, n=4) {
  u <- union(x,y)
  if(length(u) == n) y else u
}

y[, out := sapply(Reduce(unionlim, a, accumulate=TRUE), paste, collapse=" ")]

#    a   out
# 1: 1     1
# 2: 2   1 2
# 3: 2   1 2
# 4: 3 1 2 3
# 5: 3 1 2 3
# 6: 4     4
# 7: 3   4 3
# 8: 2 4 3 2
# 9: 2 4 3 2
#10: 5     5
#11: 6   5 6
#12: 7 5 6 7
#13: 9     9
#14: 8   9 8

This is far from the fastest code on the planet, but a quick test suggests it will chew about 1M cases in ~15 seconds on my decent machine.

bigy <- y[rep(1:nrow(y), 75e3)]
system.time({
  bigy[, out := sapply(Reduce(unionlim, a, accumulate=TRUE), paste, collapse=" ")]
})
#   user  system elapsed 
#  14.27    0.09   15.06 
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • thelatemail, excellent. I will dig down to understand your magic. A doubt: unionlim takes two variables x and y, how do you pass y inside the sapply? I can only see passing a to x... Thank you. – Fabio Correa May 06 '21 at 23:26
  • 1
    @FabioCorrea - `Reduce` is a funny one - it takes a function with 2 arguments x/y, but only one input because it loops through the current value plus the previous output of the loop - e.g. to sum 1 through 10 recursively you do `Reduce(function(x,y) x + y, 1:10, accumulate=TRUE)` which is `c(1, 1+2 = 3, 3 + 3 = 6, 6 + 4 = 10, ...` etc. – thelatemail May 06 '21 at 23:42
  • Got it, sure! Thank you. – Fabio Correa May 06 '21 at 23:46
1

purrr::accumulate also does the work here

y$b <- accumulate(y$a, ~if(length(union(.x, .y)) == 4) .y else union(.x, .y))

 y
   a       b
1  1       1
2  2    1, 2
3  2    1, 2
4  3 1, 2, 3
5  3 1, 2, 3
6  4       4
7  3    4, 3
8  2 4, 3, 2
9  2 4, 3, 2
10 5       5
11 6    5, 6
12 7 5, 6, 7
13 9       9
14 8    9, 8
AnilGoyal
  • 25,297
  • 4
  • 27
  • 45