0

My question is related to the internals of data.table, I guess:

Why is the sample function regarding the column as vector of size > 1, whereas the pmin function is working with the column as if they are (row-wise) variables?

I hope this example-code is clarifying my question:

library(data.table)

dt <- data.table(probs = runif(1000000), probs2 = runif(1000000))

dt[, hit := sample(c(0,1), 1, prob = c(1 - probs, probs))]
# Error in sample.int(length(x), size, replace, prob) : 
# incorrect number of probabilities

dt[, min_prob := pmin(probs, probs2)] # working as expected

dt[, hit := sample(c(0,1), 1, prob = c(1 - probs, probs)), by=1:nrow(dt)] # working

----------------------- additional -------------------------------------

Comparison of accepted answer and method using by=1:nrow(dt)

library(data.table)

dt <- data.table(probs = runif(1000000))

set.seed(1234)
system.time(dt[, hit := sapply(probs, function(x) sample(0:1, 1, prob=c(1 - x, x)))])
set.seed(1234)
system.time(dt[, hit2 := sample(c(0,1), 1, prob = c(1 - probs, probs)), by=1:nrow(dt)])

all.equal(dt$hit, dt$hit2)
# TRUE
feinmann
  • 1,060
  • 1
  • 14
  • 20
  • Both sample() and pmin() are part of base R not `data.table`. Why are they structured/designed they way the are? I guess only their authors can answer. – s_baldur Aug 18 '20 at 10:46
  • `pmin` takes multiple vectors and for each position returns minimum. In 1st case you gave `prob` parametar values `1-probs` and `probs` which has length of `2*nrow(dt)` which is more than 2 (number of values to sample from) – det Aug 18 '20 at 10:56
  • Is it good practice to do `by=1:nrow(dt)` or is there a better solution? – feinmann Aug 18 '20 at 11:21

1 Answers1

0

You are misusing the sample function. From the documentation of ?sample, the prob argument takes:

prob: a vector of probability weights for obtaining the elements of the vector being sampled.

Since you have two possible values c(0, 1), you need prob to be a vector of length 2.

But when you call prob = c(1 - probs, probs) inside your data.table call, it's the equivalent of calling prob = c(1 -df$probs, df$probs) which is a vector of length 2000000, and not of length 2 like you need.

A solution would be to use sapply:

library(data.table)
dt <- data.table(probs = runif(5), probs2 = runif(5))
dt[, hit := sapply(probs, function(x) sample(0:1, 1, prob=c(1 - x, x)))]
dt
#>        probs     probs2 hit
#> 1: 0.1196779 0.46539006   0
#> 2: 0.9896483 0.31307527   1
#> 3: 0.4169862 0.08778795   0
#> 4: 0.9456939 0.09123848   1
#> 5: 0.5033147 0.27397908   0
Vincent
  • 15,809
  • 7
  • 37
  • 39
  • Thank you for your constructive answer. If someone is interested, I added a addition to my question with a small benchmark, not very informative, but what remains: is `by=1:nrow(dt)` good advised? – feinmann Aug 18 '20 at 13:32
  • Using your `by=1:nrow(dt)` construction seems like a valid alternative. Here's a related question with a good answer which discusses some alternatives with loops and has benchmarks: https://stackoverflow.com/questions/37667335/row-operations-in-data-table-using-by-i – Vincent Aug 18 '20 at 13:59