21

I am trying to generate a random sample that excludes certain "bad data." I do not know whether the data is "bad" until after I sample it. Thus, I need to make a random draw from the population and then test it. If the data is "good" then keep it. If the data is "bad" then randomly draw another and test it. I would like to do this until my sample size reaches 25. Below is a simplified example of my attempt to write a function that does this. Can anyone please tell me what I am missing?

df <- data.frame(NAME=c(rep('Frank',10),rep('Mary',10)), SCORE=rnorm(20))
df

random.sample <- function(x) {
  x <- df[sample(nrow(df), 1), ]
  if (x$SCORE > 0) return(x)
 #if (x$SCORE <= 0) run the function again
}

random.sample(df)
user1491868
  • 596
  • 4
  • 15
  • 42

4 Answers4

33

Here is a general use of a while loop:

random.sample <- function(x) {
  success <- FALSE
  while (!success) {
    # do something
    i <- sample(nrow(df), 1)
    x <- df[sample(nrow(df), 1), ]
    # check for success
    success <- x$SCORE > 0
  }
  return(x)
}

An alternative is to use repeat (syntactic sugar for while(TRUE)) and break:

random.sample <- function(x) {
  repeat {
    # do something
    i <- sample(nrow(df), 1)
    x <- df[sample(nrow(df), 1), ]
    # exit if the condition is met
    if (x$SCORE > 0) break
  }
  return(x)
}

where break makes you exit the repeat block. Alternatively, you could have if (x$SCORE > 0) return(x) to exit the function directly.

flodel
  • 87,577
  • 21
  • 185
  • 223
4

use this after your first sample

while (any(bad <- (x$SCORE <= 0)))
   x[bad, ] <- df[sample(nrow(df), sum(bad)), ]
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
3
 random.sample <- function(x) {
   x <- df[sample(nrow(df), 1), ]
   if (x$SCORE > 0) return(x)
   Recall(x)# run the function again
 }

 random.sample(df)
#   NAME    SCORE
#14 Mary 1.252566

It seems to me that this should work as well:

 df$SCORE[ df$SCORE > 0 ][ sample(1:sum(df$SCORE > 0), 1) ]
#[1] 0.6579631
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • VERY nice help. The Recall function is not even mentioned anywhere in all of my R manuals. Is it better if I use: if (x$SCORE > 0) { return(x) } else { Recall(x) }? – user1491868 Dec 10 '13 at 23:45
  • 1
    elegant but not as efficient as a `while` loop IMHO, as it can create a large call stack. – flodel Dec 11 '13 at 00:31
  • You are essentially doing rejection sampling. It could been as simple as: `df$SCORE[df$SCORE > 0][ sample(1:(sum(df$SCORE > 0, 1)]`. I'm not sure how to advise on the checkmark. Mine was essentially a throw-away answer. Flodel is right about efficiency. Recursion is not well supported in R. – IRTFM Dec 11 '13 at 04:43
  • 1
    Regarding your `df$SCORE[df$SCORE > 0][...]`, it is the same thing I commented to Stephen: OP is giving a "*simplified example*" of a more complex situation where "*I do not know whether the data is "bad" until after I sample it*". So a recursion or a while loop are about the only possible solutions. – flodel Dec 11 '13 at 12:20
3

You can just select the rows to sample directly like so (just 5):

> df <- data.frame(NAME=c(rep('Frank',10),rep('Mary',10)), SCORE=rnorm(20))
> df[sample(which(df$SCORE>0), 5),]


 NAME     SCORE
14  Mary 1.0858854
10 Frank 0.7037989
16  Mary 0.7688913
5  Frank 0.2067499
17  Mary 0.4391216

this is without replacement, for bootstrap put in replace=T.

Stephen Henderson
  • 6,340
  • 3
  • 27
  • 33
  • 1
    I upvoted but since the OP said *I do not know whether the data is "bad" until after I sample it* I am not sure that it will work for him. His example might have been poorly chosen. – flodel Dec 11 '13 at 03:01
  • @flodel fair enough but R is not a realtime app, nor good at recursive function calls, so if the data needs checked the test is in the data and should be vectorised and put between the brackets.. like this. – Stephen Henderson Dec 11 '13 at 08:06
  • Whether I keep the observation is a function of the observation itself. I cannot determine whether to keep the observation until after it has been drawn. – user1491868 Dec 11 '13 at 14:06
  • @user1491868 if the obs is really in a DF then you can do exactly that, subset by your criteria then sample...anyway it's not really that important is it :) – Stephen Henderson Dec 11 '13 at 15:34
  • After thinking things through I decided to subset my data by the criteria before sampling. But I still think this thread is useful when it is impossible to subset the data before sampling it. Thank you to everybody for their very helpful comments and suggestions. – user1491868 Dec 11 '13 at 16:34