1

I want to generate 4 new columns from an existing variable total by random sampling. the results for each row should meet the condition s1 + s2 + s3 + s4 == total. Fro example,

> tabulate(sample.int(4, 100, replace = TRUE))
[1] 22 21 27 30

The following code does not work since the function appears to recycle the first row and applies it column-wise.

 DT <- data.table(total = c(100, 110, 90, 92))
 DT[, c(paste0("s", 1:4)) := tabulate(sample.int(4, total, replace = TRUE))]

> DT
   total s1 s2 s3 s4
1:   100 31 31 31 31
2:   110 25 25 25 25
3:    90 22 22 22 22
4:    92 22 22 22 22

How to get around this? I am clearly missing some basic understanding on how R vector/list work. Your help will be much appreciated.

Dong
  • 481
  • 4
  • 15

2 Answers2

1

Edited following edited question:

data.table will expect a list internally when you want to assign to many columns. To get it so each row is unique, then you can do that by adding a by each row:

DT <- data.table(total = c(100, 110, 90, 102, 92))
DT[, c(paste0("s", 1:4)) := {
  as.list(tabulate(sample.int(4, total, replace = TRUE)))
  }, by = seq(NROW(DT))]

Which outputs the following, satisfying the OP criteria:

> DT
   total s1 s2 s3 s4
1:   100 27 28 28 17
2:   110 25 23 36 26
3:    90 26 19 26 19
4:   102 28 24 21 29
5:    92 17 27 22 26
> apply(DT[, 2:5],1, sum)
[1] 100 110  90 102  92
  • Great answer. Thanks for the `by` tip. I wonder if the performance can be improved. I run a test with one million numbers `> DT <- data.table(total = round(rnorm(1000000, mean = 800, sd = 250)) ) > DT[total < 160, total := 240]` and it take 57 sec on my workstation computer. – Dong Sep 13 '20 at 18:51
  • I don't think you are going to get much faster. Just the `sample.int` line takes at least 40 seconds on my machine: `> system.time(vals <- sample.int(4, sum(DT$total), replace = TRUE)) ... 41.11 seconds` – eliot.mcintire Sep 14 '20 at 23:12
  • Yes. I was thinking alternative to `sample.int` --- it is linearly scaled with `total` – Dong Sep 15 '20 at 16:17
  • Apparently, there is: package `dqrng`. Replace the `sample.int` with `dqrng::dqsample.int` and it was over 3x faster for me (~20s), even on the large dataset. – eliot.mcintire Sep 16 '20 at 16:02
  • That settles it for now. Many thanks for the tips! – Dong Sep 16 '20 at 16:40
0

Maybe you can try the code below

DTout <- cbind(
  DT,
  do.call(
    rbind,
    lapply(DT$total, function(x) diff(sort(c(0, sample(x - 1, 3), x))))
  )
)

which gives

   total V1 V2 V3 V4
1:   100 51  5 17 27
2:   110 41  1 40 28
3:    90 32 34 14 10
4:   102  5 73 13 11
5:    92 17 13 17 45

Test

> rowSums(DTout[,-1])
[1] 100 110  90 102  92
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • @Dong See my test, I didn't see any problem – ThomasIsCoding Sep 14 '20 at 07:24
  • I meant your sampling to the four bins are not random. It is would be unlikely to hit a bin only once in 110 random tries as in your row 2. – Dong Sep 15 '20 at 16:14
  • @Dong What I sampled is the positions to set bins. Let's say we have 110 balls, and we have 109 possible positions to set 3 barriers, such that we randomly choose 3 positions out of 109 options to produce four partitions. So my method has four random bins. – ThomasIsCoding Sep 15 '20 at 18:17
  • My original `sample.int` function intends to randomly distribute 110 balls to four bins with equal probability. – Dong Sep 16 '20 at 15:40