How do you sample data within each group in a data.table? (fastest way possible)

Question

I am trying to sample my data within each group as in How do you sample random rows within each group in a data.table? Data:

set.seed(245)
DT = data.table( d = sample(1:2000), m = sample(1:700, 2000, replace = T))

DT[,length(unique(m))]
[1] 669
DT[,length(unique(d))]
[1] 2000

1) Firstly, approach DT[, .SD[sample(.N, 1)], by = m] is not fast enough and I am quite certain that it could be done faster and better, but the faster approach, which were mentioned in previously linked post

DTs <- DT[DT[, sample(.I, 1), by=m][[2]],]
DTs[, .N]
[1] 659    
DTs[, length(unique(d))]
[1] 633

does not work correctly, and I do not understand why (every element in DTs[, d] should be unique).

2) Secondly, when I tried a different approach (to extract only d values):

DT[, sample(d, 1L), by = m][[2]]

I noticed that each time I obtain different length unique values and also their length is not as expected:

length(unique(DT[, sample(d, 1L), by = m][[2]]))
[1] 632
length(unique(DT[, sample(d, 1L), by = m][[2]]))
[1] 638

Could someone explain why this is happening? Or what I am doing wrong? And how to do this in fastest way possible?

I think it's a problem with the design of the sample function. When `.I` is a vector, it behaves like you want, but when `.I` is a scalar (length-one vector), it spins off to do something entirely different. Try `DTs <- DT[DT[, .I[sample(.N, 1)], by=m][[2]],]; DTs[, .(.N, uniqueN(d))]` — Frank, Dec 08 '16 at 15:26
How about `unique(DT[sample(1:.N)], by = "m")`? You can add `[, list(d)]` or `[,d]` — talat, Dec 08 '16 at 15:31
Another option is `DT[DT[, max(sample(.I, 1), .I), m][,V1]]`. I benchmarked the suggestions so far, and @docendodiscimus is clear winner. — dww, Dec 08 '16 at 16:44

How do you sample data within each group in a data.table? (fastest way possible)

0 Answers0

Linked