0

I have a question to do some random sampling in R. I have two datasets. One dataset, say df1, is organized where each observation is a sample, and the location from which the sample was collected is under the variable "loc". "loc" is set as a character. An example data layout is shown below.

    ID loc x1 x2 x3 
    1  A   x  x  x
    2  A   x  x  x
    3  A   x  x  x
    4  B   x  x  x
    5  B   x  x  x 
    6  C   x  x  x 
    7  C   x  x  x 
    8  C   x  x  x
    9  C   x  x  x
    etc.

The second dataset, say df2, is a list of all of the locations and the number of random samples required from each location. It looks like this:

    loc n
    A   2
    B   1
    C   3

I am wondering how to take different numbers of random samples by group, where the number of samples required is denoted in df2.

geoscience123
  • 164
  • 1
  • 11

1 Answers1

0

We can split the first dataset by 'loc', use map2 to loop over the list with the corresponding 'n' from the second dataset and use that in sample_n

library(purrr)
library(dplyr)
map2_dfr(df1 %>% 
            group_split(loc), df2$n, ~ .x %>% 
                                         sample_n(.y))
# A tibble: 6 x 5
#     ID loc   x1    x2    x3   
#  <int> <chr> <chr> <chr> <chr>
#1     1 A     x     x     x    
#2     2 A     x     x     x    
#3     5 B     x     x     x    
#4     6 C     x     x     x    
#5     8 C     x     x     x    
#6     7 C     x     x     x    

Or another option is to a match

df1 %>% 
      group_by(loc) %>%
      sample_n(df2$n[match(first(loc), df2$loc)])

data

df1 <- structure(list(ID = 1:9, loc = c("A", "A", "A", "B", "B", "C", 
"C", "C", "C"), x1 = c("x", "x", "x", "x", "x", "x", "x", "x", 
"x"), x2 = c("x", "x", "x", "x", "x", "x", "x", "x", "x"), x3 = c("x", 
"x", "x", "x", "x", "x", "x", "x", "x")), class = "data.frame", 
row.names = c(NA, 
-9L))

df2 <- structure(list(loc = c("A", "B", "C"), n = c(2L, 1L, 3L)),
   class = "data.frame", row.names = c(NA, 
-3L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I attempted this, but I get the error: Mapped vectors must have consistent lengths: .x has length 28, .y has length 29. – geoscience123 Jan 06 '20 at 18:22
  • @coconn41 Based on the data you showed, I am not getting any errors. Please check the `data` in my post – akrun Jan 06 '20 at 18:24
  • My apologies, there was an extra observation. I now corrected it. Instead, I get an error 'size' must be less or equal than 3 (size of data), set 'replace' = TRUE to use sampling with replacement. However, the match option works well for me! – geoscience123 Jan 06 '20 at 18:30
  • @coconn41 can you please update your post with a new example – akrun Jan 06 '20 at 18:31