Imagine a df in the following format:
ID1 ID2
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 B 1
7 B 2
8 B 3
9 B 4
10 B 5
11 C 1
12 C 2
13 C 3
14 C 4
15 C 5
The problem is to randomly select one row (ideally adjustable to n rows) for the first unique value in ID1, remove the corresponding ID2 value from the dataset, randomly select a value from the remaining pool of ID2 values for the second ID1 value (i.e. recursively), and so on.
So, for example, for the first ID1 value, it would do sample(1:5, 1)
, with the result 2. For the second ID1 value, it would do sample(c(1, 3:5), 1)
, with the result 3. For the third ID1 value, it would do sample(c(1, 4:5), 1)
, with the result 5. It cannot happen that there isn't at least one unique ID2 value left to assign to a particular ID1. However, with multiple ID2 values to select (e.g. three), it may happen that there isn't a sufficient number of them; in that case, select as much as possible. In the end, the results should have a similar format:
ID1 ID2
1 A 2
2 B 3
3 C 5
It should be efficient enough to handle reasonably large datasets (tens of thousands unique values in ID1 and hundreds of thousands unique values per ID2).
I tried multiple ways to solve this problem, but honestly none of them are meaningful and would likely only contribute to confusion, so I'm not sharing them here.
Sample data:
df <- data.frame(ID1 = rep(LETTERS[1:3], each = 5),
ID2 = rep(1:5, 3))