3

We stumbled upon some strange behaviour trying to expand a data.table. The following code works alright:

dt <- data.table(var1=1:2e3, var2=1:2e3, freq=1:2e3)
system.time(dt.expanded <- dt[ ,list(freq=rep(1,freq)),by=c("var1","var2")])
##    user  system elapsed 
##    0.05    0.01    0.06

But using the following data.table

set.seed(1)
dt <- data.table(var1=sample(letters,1000,replace=T),var2=sample(LETTERS,1000,replace=T),freq=sample(1:10,1000,replace=T))

with the same code gives

Error in rep(1, freq) : invalid 'times' argument

My question
Might this be a bug in data.table?

(I got the syntax of the this example from R Machine Learning Essentials)

Edit
So the problem really seems to be with rep and not with data.table. The help page for rep says for the parameter times:

A integer vector giving the (non-negative) number of times to repeat each element if of length length(x), or to repeat the whole vector if of length 1.

The second data.table creates times of different length than x which throws the error.

Community
  • 1
  • 1
vonjd
  • 4,202
  • 3
  • 44
  • 68

1 Answers1

6

My guess: when rep(x,times) is given a vector for times, it insists that x be the same length (instead of doing the natural thing in R and recycling). So manual recycling works:

dt[ ,.(rep(rep(1,.N),freq)), by=.(var1,var2)]

Seems to be a problem in base R (or maybe it's deliberate?), not in data.table. The OP didn't hit this problem in the first example because by=.(var1,var2) ensured that only one row was returned for each group, so the times argument was a scalar.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • So what has the `by` clause to do with it? – vonjd Jul 07 '15 at 18:53
  • @vonjd The `by` clause ensured that `times` was a scalar, not a vector. I'll add that to the answer. – Frank Jul 07 '15 at 18:54
  • Technically there are no scalars in R, only vectors with length 1... still not understanding the problem fully :-( – vonjd Jul 07 '15 at 18:55
  • I'm using "scalar" as a shorthand for a vector of length 1, yes. – Frank Jul 07 '15 at 18:56
  • @vonjd Try `dt[,.N,by=.(var1,var2)][,table(N)]` to see the different group sizes in `dt`. If any of them are different from `1`, the error will be thrown. – Frank Jul 07 '15 at 18:56