Select random sample by group, with additional condition in R

Question

Based on this post, I'm trying to make a sample of rows. Using the same R iris data example. I've correctly create a sample of 15 rows for each species

Selec_ir<-iris[ with(iris, unlist(tapply(seq_len(nrow(iris)),
                          Species, FUN = sample, 15,replace=FALSE))), ]

But now how to create a sample based on the condition that the new selected row must be at least after 20 rows from the last selected one?

Your question is a little unclear; please provide an example of your desired output to illustrate. — nrussell, Dec 19 '16 at 13:09
how are you going to draw 15 samples that are at least 20 rows after the previous sample when iris only has 150 rows? — manotheshark, Dec 19 '16 at 13:15
@manotheshark, sorry we can randomly select only 2 rows instead of 15; — freestyle, Dec 19 '16 at 13:24
inline `Selec_ir<-iris[ with(iris, unlist(tapply(seq_len(nrow(iris)), Species, FUN = sample, 3,replace=FALSE))), ]` — freestyle, Dec 19 '16 at 13:26
@manotheshark, the idea is that if one row is selected, the next selected one must be at least at the 20th position from the last selected one. — freestyle, Dec 19 '16 at 13:29
Do we still have any restrictions on how many records of each species there should be? — Iaroslav Domin, Dec 19 '16 at 13:39

manotheshark · Answer 1 · 2016-12-26T20:44:09.927

The following function will be used to pass all row_numbers for each group in the data set and then draw a sample without replacement and then drop all values that fall within the step size by using a combination of split and findInterval. The returned array will be used to slice out the desired sample size with the desired sample step.

Modify sample_size and sample_step as needed to adjust the intial sample size and number of rows between retained samples

library(plyr)

sample_drop <- function(x, sample_size, sample_step=1) {

  # draw sample and convert to list
  lst_samp <- list(sort(sample(x, size=sample_size, replace=FALSE)))

  # function to split last element of list by step size
  split_last <- function(lst, step) {
    lst_tail <- unlist(tail(lst, n=1L))
    split(lst_tail, findInterval(lst_tail, c(0, step) + min(lst_tail)))
  }

  # split list until all values of last element fall within step size
  while(do.call(function(x) max(x) - min(x), list(unlist(tail(lst_samp, n=1L)))) >= sample_step) {
    lst_samp <- c(head(lst_samp, n=-1L), split_last(lst_samp, sample_step))
  }

  #lst_samp <- llply(lst_samp, unname) # for debug only to remove attr names
  laply(lst_samp, min) # return minimum value from each element

}

Here is the function applied to the iris dataset.

library(dplyr)

data("iris")

sample <- list()
sample$seed <- 1
sample$size <- 15L
sample$step <- 20L

# simulate sample draws with dropping and compare to iris results
set.seed(sample$seed)
sample_drop(50, sample$size, sample$step)
sample_drop(50, sample$size, sample$step)
sample_drop(50, sample$size, sample$step)

set.seed(sample$seed)
iris %>%
  group_by(Species) %>%
  mutate(gid=row_number()) %>%
  slice(sample_drop(n(), sample$size, sample$step))

Here is the function applied to the larger diamonds dataset

library(dplyr)
library(ggplot2)

data("diamonds")

sample <- list()
sample$seed <- 1
sample$size <- 1000L
sample$step <- 20L

set.seed(sample$seed)
diamonds %>%
  group_by(cut) %>%
  mutate(gid=row_number()) %>%
  slice(sample_drop(n(), sample$size, sample$step))

set.seed(sample$seed)
diamonds %>%
  group_by(cut) %>%
  mutate(gid=row_number()) %>%
  slice(sample_drop(n(), sample$size, sample$step)) %>%
  summarise(samples=n())

There is likely room for improvement, but this is a lot easier for me to follow

It's exactly what i'm looking for but I'm an R beginner and I've never used dplyr library. How to adapt it to two different dataset in which: - I must select randomly 15 lines for each Species (in this case I have 7 Species instead of three for iris). - 10 lines for each species(13 species) In your code when i use to change the samp_size a samp_step, got this error: `Sample size (7) greater than population size (4). Do you want replace = TRUE?` — freestyle, Dec 21 '16 at 11:00
@freestyle that error typically means that you are telling `sample` to draw more samples then the original `length` of the data while `replace = FALSE`. If `replace` is set to `TRUE` then it can redraw from the data to fill up the specified sample length. Your comment says you have 10 lines for each Species, but I'd look there first to make sure you have sufficient rows and that the `group_by` command is set correctly. — manotheshark, Dec 21 '16 at 13:30
@freestyle try the following command to verify the number of rows per group `iris %>% group_by(Species) %>% summarise(n())` — manotheshark, Dec 21 '16 at 13:45
@freestyle I changed the approach to use a function. This should work for any dataset as it will reduce the sample size if there are not enough values to sample from. — manotheshark, Dec 23 '16 at 16:59

Select random sample by group, with additional condition in R

1 Answers1