2

I have a dataset where each id has multiple samples and can be stratified into group variable. I would like to do random sampling, stratified by group, but not have the id be repeated (i.e. each id only appears once in the output).

I have tried to modify some existing solutions, however, all seem to sample the data and include multiple samples from a single id across the groups:

I have tried the following, thinking replace = FALSE may help to ensure that only 1 sample from each id is used, but this still does not do what I want.

set.seed(1)
# Data 
data <- data.frame(
  id = c("A", "C", "B", "D", "E", "F", "A", "A", "B", "B", "B", "D", "D", "E", "E", "F"),
  group = c("1", "1", "2", "2", "3", "3", "2", "1", "1", "2", "3", "2", "3", "2", "1", "3"),
  length = c("54", "52", "43", "42", "60", "46", "59", "60", "51", "45", "47", "58", "48", "46", "56", "57"))

# Stratified random sampling by group 
sample <- data %>%
  distinct %>%
  group_by(group) %>%
  sample_n(2, replace = FALSE) %>%
  left_join(data)

sample output:

id group length
A   1   60      
C   1   52      
D   2   42      
A   2   59      
B   3   47      
E   3   60      

However, as seen above, the id= A is repeated in group 1 and 2. The ideal output I would like should look something like this where each id appears only once and samples are stratified by group:

id group length
A   1   54      
C   1   52      
B   2   43      
D   2   42      
E   3   60      
F   3   46

Is there a way to customise the existing solutions so that when sampling for each group, if an id has already been used for another group, it will be excluded and not sampled for another group? I know I can add %>% distinct(id) to my code but I believe this would not be random anymore as distinct() just picks up the first row for that id. Thank you for any help!

smicaela
  • 109
  • 8
  • 1
    If it is small and known number of groups, are you able to iterate over groups while keeping track of ids being sampled, so that when you go to next group, you do setdiff() before sampling? – Yuan Yin Jul 20 '21 at 05:26

2 Answers2

1

I have a candidate solution for you, using for-loops. Granted, the solution is a bit awkward, and has some caveats which are related to your provided data. However, the script works as intended.

# Split by group; this provides
# a list with each group.
data_list <- data %>% split(
        f = .$group
)

# shuffle the list to introduce
# randomness
shuffle <- sample(length(data_list))

data_list <- data_list[shuffle]

# Sample from the first indice
# which serves as a baseline for remaining
# samples
sampled_data <- data_list[[1]] %>%
        distinct(id, .keep_all = TRUE) %>%
        sample_n(2)


for (i in 2:length(data_list)) {
        
        # Proceed to next group
        new_data <- data_list[[i]]
        
        
        indicator <- new_data$id %in% sampled_data$id
        
        sampled_data <- bind_rows(
                sampled_data,
                new_data[!indicator,] %>% distinct(id, .keep_all = TRUE) %>% group_by(group) %>% sample_n(2)
        )
        
        
        
}

This algorithm, with the data that you provided, works if the initial sampled_data has specific ids present as, otherwise, the availability of unique ids will deplete.

The algorithm starts by splitting your data in respective groups by using split, and shuffles the order of the list to introduce randomness in your distinct function.

Initial Sampling

We start by taking a sample from the first group, which then serves as a baseline for the remainder of the groups.

It starts by removing all id from the next indices that are present in the baseline sample. And then samples and binds that to the list, and creates a data.frame.

Next Sample

The new data.frame now consists of the first two groups that are distinct in id, and removes id from the remaining indice that are present in that data.frame.

End product is the following;

id group length
1  B     1     51
2  C     1     52
3  D     2     42
4  A     2     59
5  E     3     60
6  F     3     46

The algorithm, clearly, needs some polishing if the data you provided are representative for you actual data, as depending on the seed the availability of unique values depletes depending on your initial id.

I did not provide a seed as I had trouble finding one suitable.

Serkan
  • 1,855
  • 6
  • 20
  • Thanks for the clear answer! The solution I used in the end had similar logic to your answer in that I first introduced randomisation. This was by `set.seed()` then randomising all rows using `data2 <- data[sample(nrow(data)),]`. Since now the rows were randomised, I could use `distinct()` without worrying about it picking only the first row for each `id`. Then I `set.seed()` again and used `data2` to run the remaining codes. The `seed` needed to be played around with so that the unique `id` would be maximised for the groups. – smicaela Jul 21 '21 at 12:27
  • 1
    Quite a clever modification! :-) It was tough nut to crack honestly. – Serkan Jul 21 '21 at 12:30
  • 1
    Can you accept the answer, so we know its closed then? Or do you expect more answers? :-) – Serkan Jul 22 '21 at 11:15
  • I have added the solution I used in the end as the accepted answer. Thank you again for your solution, will be helpful when a looped version is required. – smicaela Jul 23 '21 at 08:44
  • Nah - with `dplyr` and `group_by()` you'll hardly need a `for-loop`! – Serkan Jul 23 '21 at 08:45
0

This is the solution I used in the end.

# Randomise rows
set.seed(x) # play around and set seed accordingly
data_rows <- sample(nrow(data))
data2 <- data[data_rows, ]

# Stratified random sampling 
set.seed(x) # play around and set seed accordingly
randomised <- data2 %>%
              distinct(id, .keep_all = TRUE) %>%
              group_by(group) %>% 
              sample_n(2, replace = FALSE) %>%
              ungroup() 
smicaela
  • 109
  • 8