I've got a dataset which represents 50,000 simulations. Each simulation has multiple scenario id's and associated with each scenario id is a second identifier called target. The first four simulations might look like the following:
+----------------------------------------------+
| SIMULATION |SCENARIO ID |TARGET ID |
| | | |
+----------------------------------------------+
| | | |
| 1 | 12 | 11 |
| 1 | 10 | 2 |
| 1 | 1 | 18 |
| 2 | 3 | 9 |
| 2 | 7 | 10 |
| 2 | 21 | 2 |
| 3 | 17 | 15 |
| 3 | 12 | 9 |
| 4 | 7 | 16 |
+---------------+--------------+---------------+
I want to sample down this 50,000 simulation set into a 10,000 simulation set, while retaining the best possible representation of the 50,000 set in respect of the frequency of each scenario / target combination.
I've tried using stratified sampling using the stratified function in the splitstackshape package and setting the scenario id and target id as a group. However I can only specify the sample size of each group.
I can play with the proportion sampled from each group until it gets close to 10,000 simulations but that's not ideal as I need this to be as automated as possible.