0

At the data.table in column C3 I want to flag N randomly selected rows by each group (C1). There are several similar questions have already been asked on SO here, here and here. But based on the answers still cannot figure out how to find a solution for my task.

set.seed(1)    
dt = data.table(C1 = c("A","A","A","B","C","C","C","D","D","D"), 
                 C2 = c(2,1,3,1,2,3,4,5,4,5)) 

dt
    C1 C2
 1:  A  2
 2:  A  1
 3:  A  3
 4:  B  1
 5:  C  2
 6:  C  3
 7:  C  4
 8:  D  5
 9:  D  4
10:  D  5

Here are row indexes for two randomly selected rows by each group C1 (doesn't work well for group B):

dt[, sample(.I, min(.N, 2)), by = C1]$V1
[1]  1  3  3  7  5 10  9

NB: for B only one row should be selected because group B consists of one row only.

Here is a solution for one randomly selected row in each group, which often doesn't work for group B:

dt[, C3 := .I == sample(.I, 1), by = C1]
dt
    C1 C2    C3
 1:  A  2 FALSE
 2:  A  1  TRUE
 3:  A  3 FALSE
 4:  B  1 FALSE
 5:  C  2  TRUE
 6:  C  3 FALSE
 7:  C  4 FALSE
 8:  D  5  TRUE
 9:  D  4 FALSE
10:  D  5 FALSE

Actually I want to expand it on N rows. I've tried (for two rows):

dt[, C3 := .I==sample(.I, min(.N, 2)), by = C1]

which of course doesn't work.

Any help is much appreciated!

Serhii
  • 362
  • 4
  • 15

2 Answers2

2
dt[, C3 := 1:.N %in% sample(.N, min(.N, 2)), by = C1]

Or use head, but I think that should be slower

dt[, C3 := 1:.N %in% head(sample(.N), 2) , by = C1]

If the number of flagged rows is not constant you can do

flagsz <- c(2, 1, 2, 3)
dt[, C3 := 1:.N %in% sample(.N, min(.N, flagsz[.GRP])), by = C1]
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
  • Thanks for the quick answer! Just to understand: we check all rows if they were sampled and those which were sampled receives TRUE, right? – Serhii May 11 '18 at 14:51
  • Yes. `1:.N` gives the in-group row-number, and if the row-number is in the sample of those row numbers `sample(.N, min(.N, 2))` then `C3` is `TRUE` – IceCreamToucan May 11 '18 at 14:54
  • I wonder what if the number of items to flag is different for each group C1? Suppose N = c(2, 1, 2, 3) – Serhii May 15 '18 at 11:02
  • Awesome! Thanks, Ryan! – Serhii May 15 '18 at 13:05
1
N=2
dt[, C3 := {if (.N < N) rep(TRUE,.N) else 1:.N %in%  sample(.N,N) }, by=C1]
dt
# C1 C2    C3
# 1:  A  2  TRUE
# 2:  A  1 FALSE
# 3:  A  3  TRUE
# 4:  B  1  TRUE
# 5:  C  2 FALSE
# 6:  C  3  TRUE
# 7:  C  4  TRUE
# 8:  D  5  TRUE
# 9:  D  4  TRUE
# 10:  D  5 FALSE
dww
  • 30,425
  • 5
  • 68
  • 111
  • Thanks! You answer works just fine. I have accepted Ryan's answer because his answer doesn't consist of if-else. – Serhii May 11 '18 at 14:55