2

Based on this post, I'm trying to make a sample of rows. Using the same R iris data example. I've correctly create a sample of 15 rows for each species

Selec_ir<-iris[ with(iris, unlist(tapply(seq_len(nrow(iris)),
                          Species, FUN = sample, 15,replace=FALSE))), ]

But now how to create a sample based on the condition that the new selected row must be at least after 20 rows from the last selected one?

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
freestyle
  • 67
  • 8

1 Answers1

0

The following function will be used to pass all row_numbers for each group in the data set and then draw a sample without replacement and then drop all values that fall within the step size by using a combination of split and findInterval. The returned array will be used to slice out the desired sample size with the desired sample step.

Modify sample_size and sample_step as needed to adjust the intial sample size and number of rows between retained samples

library(plyr)

sample_drop <- function(x, sample_size, sample_step=1) {

  # draw sample and convert to list
  lst_samp <- list(sort(sample(x, size=sample_size, replace=FALSE)))

  # function to split last element of list by step size
  split_last <- function(lst, step) {
    lst_tail <- unlist(tail(lst, n=1L))
    split(lst_tail, findInterval(lst_tail, c(0, step) + min(lst_tail)))
  }

  # split list until all values of last element fall within step size
  while(do.call(function(x) max(x) - min(x), list(unlist(tail(lst_samp, n=1L)))) >= sample_step) {
    lst_samp <- c(head(lst_samp, n=-1L), split_last(lst_samp, sample_step))
  }

  #lst_samp <- llply(lst_samp, unname) # for debug only to remove attr names
  laply(lst_samp, min) # return minimum value from each element

}

Here is the function applied to the iris dataset.

library(dplyr)

data("iris")

sample <- list()
sample$seed <- 1
sample$size <- 15L
sample$step <- 20L

# simulate sample draws with dropping and compare to iris results
set.seed(sample$seed)
sample_drop(50, sample$size, sample$step)
sample_drop(50, sample$size, sample$step)
sample_drop(50, sample$size, sample$step)

set.seed(sample$seed)
iris %>%
  group_by(Species) %>%
  mutate(gid=row_number()) %>%
  slice(sample_drop(n(), sample$size, sample$step))

Here is the function applied to the larger diamonds dataset

library(dplyr)
library(ggplot2)

data("diamonds")

sample <- list()
sample$seed <- 1
sample$size <- 1000L
sample$step <- 20L

set.seed(sample$seed)
diamonds %>%
  group_by(cut) %>%
  mutate(gid=row_number()) %>%
  slice(sample_drop(n(), sample$size, sample$step))

set.seed(sample$seed)
diamonds %>%
  group_by(cut) %>%
  mutate(gid=row_number()) %>%
  slice(sample_drop(n(), sample$size, sample$step)) %>%
  summarise(samples=n())

There is likely room for improvement, but this is a lot easier for me to follow

manotheshark
  • 4,297
  • 17
  • 30
  • It's exactly what i'm looking for but I'm an R beginner and I've never used dplyr library. How to adapt it to two different dataset in which: - I must select randomly 15 lines for each Species (in this case I have 7 Species instead of three for iris). - 10 lines for each species(13 species) In your code when i use to change the samp_size a samp_step, got this error: `Sample size (7) greater than population size (4). Do you want replace = TRUE?` – freestyle Dec 21 '16 at 11:00
  • @freestyle that error typically means that you are telling `sample` to draw more samples then the original `length` of the data while `replace = FALSE`. If `replace` is set to `TRUE` then it can redraw from the data to fill up the specified sample length. Your comment says you have 10 lines for each Species, but I'd look there first to make sure you have sufficient rows and that the `group_by` command is set correctly. – manotheshark Dec 21 '16 at 13:30
  • @freestyle try the following command to verify the number of rows per group `iris %>% group_by(Species) %>% summarise(n())` – manotheshark Dec 21 '16 at 13:45
  • I have 43249 rows in my dataset. – freestyle Dec 22 '16 at 10:18
  • @freestyle I changed the approach to use a function. This should work for any dataset as it will reduce the sample size if there are not enough values to sample from. – manotheshark Dec 23 '16 at 16:59