0

I have 20 workers doing 100 tasks each. I have generated the true answer for each task, which is 1 out of 5 answers by

answers <- c("liver", "blood", "lung", "brain", "heart")
truth <- sample(answers, no.tasks, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))

My dataSet contains the columns workerID, taskID, truth. Now I need to generate another vector where I am simulating what the worker will answer based on a certain probability. For example, if my truth for task 1, worker 1 is "liver", I want the worker 1 to answer "liver" for task 1 with a high probability. Similarly for each of the five answers for all the 2000 tasks, I want the workers answers. For that I am using the following for and if loops.

for (i in nrow(dataSet)){
if (dataSet$truth[i] == "liver")
{
df <- (rep(sample(answers, no.tasks, prob = c(0.9, 0.02, 0.02, 0.02, 0.02), no.workers)))
} else if (dataSet$truth[i] == "blood")
{ 
df <-  (rep(sample(answers, no.tasks, prob = c(0.02, 0.9, 0.02, 0.02, 0.02), no.workers)))
} else if (dataSet$truth[i] == "lung")
{
df <- (rep(sample(answers, no.tasks, prob = c(0.02, 0.02, 0.9, 0.02, 0.02), no.workers)))
} else if (dataSet$truth[i] == "brain")
{
df <- (rep(sample(answers, no.tasks, prob = c(0.02, 0.02, 0.02, 0.9, 0.02), no.workers)))
} else if (dataSet$truth[i] == "heart")
{
df <-  (rep(sample(answers, no.tasks, prob = c(0.02, 0.02, 0.02, 0.02, 0.9), no.workers)))
} else {
df <- (rep(sample(answers, no.tasks, prob = c(0.2, 0.2, 0.2, 0.2, 0.2), no.workers)))
}
}

But, since my truth for task 1 is brain, the output vector df has a lot of answers which are "brain". Can some one please hint as to what is going wrong here?

amrapaliz
  • 5
  • 4
  • 1
    I haven't tried running your code yet, but looking at it, it doesn't look like you are actually storing your result each round, but are instead overwriting `df` everytime. Try adding a statement at the top `df <- matrix(nrow = nrow(dataSet), ncol = no.tasks)` and make your assignments `df[i, ] <- ...` – Barker Sep 30 '16 at 00:17
  • Please show expected output. Only one vector? One vector per answer per task? – Parfait Sep 30 '16 at 01:28
  • @Parfait yes I want only one vector as the output – amrapaliz Sep 30 '16 at 03:23
  • 1
    And what should that vector look like given example data? This helps us reproduce. – Parfait Sep 30 '16 at 03:40
  • @Barker I did that but its giving me NA as the values :/. – amrapaliz Sep 30 '16 at 03:45
  • @Parfait just a vector of 2000 (20 workers*100 tasks) values with either one of the five answers > df [1] "liver" "heart" "blood" "lung" "lung" "lung" "liver" "blood" "lung" "blood" "heart" [12] "blood" "blood" "lung" "liver" "brain" "brain" "lung" "liver" "lung" "lung" "blood" [23] "liver" "lung" "heart" "heart" "blood" "liver" "lung" "brain" "brain" "blood" "blood" .... – amrapaliz Sep 30 '16 at 04:16
  • ok so I changed 2 things: 1. I added df <- vector(mode="character", length=2000) and 2. for (i in 1:nrow(dataSet)), the 1: was missing. When I run the loop, I get a vector that I want but then I get this warning: In df[i] <- (rep(sample(answers, no.tasks, prob = c(0.02, ... : number of items to replace is not a multiple of replacement length But, this is ok to ignore, right? because I am replace each value in the same vector? – amrapaliz Sep 30 '16 at 04:44
  • Also, can I do this without a loop? – amrapaliz Sep 30 '16 at 05:33

1 Answers1

1

Consider initializing with a list that carries underlying character vector of 1,000 elements.

df <- vector("list", 2000) 

for (i in 1:nrow(dataSet)){
if (dataSet$truth[i] == "liver")
{
df[[i]] <-(rep(sample(answers, no.tasks, prob = c(0.9, 0.02, 0.02, 0.02, 0.02), no.workers)))
} else if (dataSet$truth[i] == "blood")
{ 
df[[i]] <-(rep(sample(answers, no.tasks, prob = c(0.02, 0.9, 0.02, 0.02, 0.02), no.workers)))
} else if (dataSet$truth[i] == "lung")
{
df[[i]] <-(rep(sample(answers, no.tasks, prob = c(0.02, 0.02, 0.9, 0.02, 0.02), no.workers)))
} else if (dataSet$truth[i] == "brain")
{
df[[i]] <-(rep(sample(answers, no.tasks, prob = c(0.02, 0.02, 0.02, 0.9, 0.02), no.workers)))
} else if (dataSet$truth[i] == "heart")
{
df[[i]] <-(rep(sample(answers, no.tasks, prob = c(0.02, 0.02, 0.02, 0.02, 0.9), no.workers)))
} 
}

Alternatively, you can use lapply() that will output the same length list vector as the input (i.e., rows of dataSet), not requiring initialization:

df2 <- lapply(seq_len(nrow(dataSet)), function(i){
  if (dataSet$truth[i] == "liver")
  {
  temp <- (rep(sample(answers, no.tasks, prob = c(0.9, 0.02, 0.02, 0.02, 0.02), no.workers)))
  } else if (dataSet$truth[i] == "blood")
  { 
  temp <- (rep(sample(answers, no.tasks, prob = c(0.02, 0.9, 0.02, 0.02, 0.02), no.workers)))
  } else if (dataSet$truth[i] == "lung")
  {
  temp <- (rep(sample(answers, no.tasks, prob = c(0.02, 0.02, 0.9, 0.02, 0.02), no.workers)))
  } else if (dataSet$truth[i] == "brain")
  {
  temp <- (rep(sample(answers, no.tasks, prob = c(0.02, 0.02, 0.02, 0.9, 0.02), no.workers)))
  } else if (dataSet$truth[i] == "heart")
  {
  temp <- (rep(sample(answers, no.tasks, prob = c(0.02, 0.02, 0.02, 0.02, 0.9), no.workers)))
  } 
  return(temp)
})

Even better, you can trim down the nested if statements by matching the current dataSet$truth in answers vector, and then replacing the corresponding index in the probability vector with 0.9:

df3 <- lapply(seq_len(nrow(dataSet)), function(i){
  probs <- c(0.02, 0.02, 0.02, 0.02, 0.2)      
  probs[match(dataSet$truth[i], answers)] <- 0.9

  temp <- (rep(sample(answers, no.tasks, prob = probs, no.workers)))
})
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Yes, thank you the lapply function is exactly what I wanted. That works well and gets rid of the loop, which is perfect because I will be working with larger data. – amrapaliz Sep 30 '16 at 23:34
  • Great! Please accept if answer helped and confirms resolution. Also, `lapply()` is technically still a loop but a vectorized one and provides more clarity. See: http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar – Parfait Oct 01 '16 at 00:34
  • question: After I get the answers, I want to compare it with the answers from the dataSet to calculate the inter-rater agreement i.e. the kappa value. But, when I run this program a 100 times, I get some of the irr's to be negative. Do you have a clue as to why they would be negative? – amrapaliz Oct 13 '16 at 20:37
  • That might need to be a new question as I am not aware of your *irr* process. – Parfait Oct 14 '16 at 03:04