My question is related to the internals of data.table, I guess:
Why is the sample
function regarding the column as vector of size > 1, whereas the pmin
function is working with the column as if they are (row-wise) variables?
I hope this example-code is clarifying my question:
library(data.table)
dt <- data.table(probs = runif(1000000), probs2 = runif(1000000))
dt[, hit := sample(c(0,1), 1, prob = c(1 - probs, probs))]
# Error in sample.int(length(x), size, replace, prob) :
# incorrect number of probabilities
dt[, min_prob := pmin(probs, probs2)] # working as expected
dt[, hit := sample(c(0,1), 1, prob = c(1 - probs, probs)), by=1:nrow(dt)] # working
----------------------- additional -------------------------------------
Comparison of accepted answer and method using by=1:nrow(dt)
library(data.table)
dt <- data.table(probs = runif(1000000))
set.seed(1234)
system.time(dt[, hit := sapply(probs, function(x) sample(0:1, 1, prob=c(1 - x, x)))])
set.seed(1234)
system.time(dt[, hit2 := sample(c(0,1), 1, prob = c(1 - probs, probs)), by=1:nrow(dt)])
all.equal(dt$hit, dt$hit2)
# TRUE