This question is a follow-up to my previous question on recursive random sampling Efficient recursive random sampling. The solutions in that thread work fine when the groups are of identical size or when a fixed number of samples per group is required. However, let's imagine a dataset as follows;
ID1 ID2
1 A 1
2 A 6
3 B 1
4 B 2
5 B 3
6 C 4
7 C 5
8 C 6
9 D 6
10 D 7
11 D 8
12 D 9
where we want to randomly sample up to n ID2 for each ID1, and doing so recursively. Recursively here means that we are moving from the first ID1 to the last ID1, and if an ID2 was already sampled for an ID1, then it should not be used for a subsequent ID1. Let's say n = 2, then expected results would be as follows;
ID1 ID2
1 A 1
2 A 6
4 B 2
5 B 3
6 C 4
7 C 5
11 D 8
12 D 9
- For ID1 = "A", there are exactly two potential ID2, so both are selected.
- For ID1 = "B", there are two potential ID2 left to select, so both are selected.
- For ID1 = "C", there are two potential ID2 left to select, so both are selected.
- For ID = "D", there are three potential ID2 left to sample from, so two are randomly selected from those.
What can happen beyond the situation shown in the example;
- Every ID1 always has a non-zero number of ID2 available, however, it is possible that all of those ID2 were already used. In that case, ID1 should be simply left out.
- It is possible that none of ID1 will have the specified n of ID2. In that case, the n closest to specified n should be retrieved.
- ID doesn't have to be
seq(ID1)
. - ID2 could be also a character vector similar to ID1.
Sample df;
df <- structure(list(ID1 = c("A", "A", "B", "B", "B", "C", "C", "C",
"D", "D", "D", "D"), ID2 = c(1, 6, 1, 2, 3, 4, 5, 6, 6, 7, 8,
9)), class = "data.frame", row.names = c(NA, -12L))