Is there a way to target an overall sample size when using stratified sampling in R?

Question

I've got a dataset which represents 50,000 simulations. Each simulation has multiple scenario id's and associated with each scenario id is a second identifier called target. The first four simulations might look like the following:

+----------------------------------------------+
| SIMULATION    |SCENARIO ID   |TARGET ID      |
|               |              |               |
+----------------------------------------------+
|               |              |               |
| 1             | 12           | 11            |
| 1             | 10           | 2             |
| 1             | 1            | 18            |
| 2             | 3            | 9             |
| 2             | 7            | 10            |
| 2             | 21           | 2             |
| 3             | 17           | 15            |
| 3             | 12           | 9             |
| 4             | 7            | 16            |
+---------------+--------------+---------------+

I want to sample down this 50,000 simulation set into a 10,000 simulation set, while retaining the best possible representation of the 50,000 set in respect of the frequency of each scenario / target combination.

I've tried using stratified sampling using the stratified function in the splitstackshape package and setting the scenario id and target id as a group. However I can only specify the sample size of each group.

I can play with the proportion sampled from each group until it gets close to 10,000 simulations but that's not ideal as I need this to be as automated as possible.

What do you mean under the 'frequency of each scenario/target combination'? Does it mean that, for example, SCENARIO ID = 12 and TARGET ID = 11 (the first row) may be repeated in the SIMULATION = 5, thus making the frequency for this combination more than 1? — Serhii, Feb 19 '20 at 17:02

score 0 · Answer 1 · answered Mar 13 '20 at 10:51

If it is not too late, I may propose the following solution.

First, load the library and generate dataset (of course there is no need to generate dataset in your case):

    library(data.table)

    # Generate dataset ...
    df = data.table(Simulation = sample(1:4, 60, replace = TRUE),
                    Scenario.ID = sample(1:5, 60, replace = TRUE),
                    Target.ID = sample(1:2, 60, replace = TRUE))
    # ... and sort it
    df = df[order(Simulation, Scenario.ID, Target.ID)]

Second, define the decreasing ratio. In this example, I am using n = 3, in your case, it will be n = 5 or any other number that fits the goal.

n = 3

Third, define the number of rows to be taken from each combination of scenario and target. I round numbers; they must be integers. If the rounded number is zero, then 1 is taken as a sample to keep the representation of every combination of scenarios and targets.

group.sample = df[, .N, by = .(Scenario.ID, Target.ID)][, pmax(round(N/n), 1)]
group.sample
 [1] 1 2 2 2 2 2 3 2 3 1

Fourth, mark records to be taken into the sample (thanks to this answer). I use set.seed to make the example reproducible. The selection is random.

set.seed(1)
df[, Sample := 1:.N %in% sample(.N, min(.N, group.sample[.GRP])), by = .(Scenario.ID, Target.ID)]

head(df[order(Simulation, Scenario.ID, Target.ID)])
       Simulation Scenario.ID Target.ID Sample
    1:          1           1         1  FALSE
    2:          1           1         1   TRUE
    3:          1           1         2  FALSE
    4:          1           2         1  FALSE
    5:          1           2         2  FALSE
    6:          1           3         1  FALSE

Fifth, compare the original proportion of scenario and target combination with the sampled one. The proportions are rounded to two digits after the comma.

df[, .(Original = round(.N/ nrow(df), 2),
       Sampled = round(length(Sample[Sample == TRUE])/df[Sample == TRUE, .N], 2)), 
   by = .(Scenario.ID, Target.ID)]

    Scenario.ID Target.ID Original Sampled
 1:           1         1     0.07    0.05
 2:           1         2     0.10    0.10
 3:           2         1     0.10    0.10
 4:           2         2     0.08    0.10
 5:           3         1     0.12    0.10
 6:           4         1     0.08    0.10
 7:           4         2     0.15    0.15
 8:           5         1     0.08    0.10
 9:           3         2     0.17    0.15
10:           5         2     0.05    0.05

Is there a way to target an overall sample size when using stratified sampling in R?

1 Answers1