1

I have a data frame with probabilities for three outcomes: A, B and C. Their probabilities are prob1, prob2, and prob3:

df = data.frame(prob1=runif(1000,0,0.2),prob2=runif(1000,0,0.1))
df$prob3 = 1-df$prob1-df$prob2

I am trying to simulate an outcome for each row given its unique probabilities and run the following loop:

df$outcome = NA
for (i in 1:1000) {
   df$outcome[i]<-sample(c(A,B,C), 1, prob = c(df$prob1[i],df$prob2[i],df$prob3[i]), replace = FALSE)
}

I have a large data set and would like to avoid loops. How can I do that?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
newbie2019
  • 33
  • 2

2 Answers2

2

Here's one way via multinomial sampling:

m <- t(apply(df,1,rmultinom,n=1,size=1))  ## 1000 x 3 matrix of 0/1 values
w <- apply(m,1,which)                     ## vector of 1000 values in {1,2,3}

If you want labels you could follow this with c("A","B","C")[w].

If you want to go beyond base R, the Hmisc package has rMultinom:

library(Hmisc)
colnames(df) <- c("A","B","C")
w <- rMultinom(df, m=1)

I modified the column names because rMultinom automatically uses the column names as the values of the samples.

If you need really fast vectorized multinomial sampling and you're willing to deal with the hassle of compiled code, the answers to this question can help.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
1

You can use apply :

df$outcome <- apply(df, 1, function(x) sample(c(A, B, C), 1, prob = x))

Or using dplyr rowwise :

library(dplyr)

df %>%
  rowwise() %>%
  mutate(outcome = sample(c(A,B, C), 1, prob = c_across()))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213