stratified sampling size varies based on group in R

Question

I'm fairly new to R. Now I'm stuck with Stratified sampling when sample size changes based on group.

The data looks like this:

And the sample size varies based on different group or strata:

I used stratified sampling, but can't figure out the sample size.

Result <- stratified(Population, c("Loc", "Format"), 
                 Population$SampleSize), replace = FALSE, 
                 keep.rownames = T)

An error message saying " size should be entered as a named vector". Could anyone help? Thank you.

What happen when you delete parenthesis after `Population$SampleSize`? — nghauran, Oct 09 '17 at 22:11
don't post images of your data, they aren't helpful because we can't load images. Read this site's help and [this](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and edit your question. — shea, Oct 09 '17 at 22:11

score 1 · Accepted Answer · answered Oct 10 '17 at 16:18

I assume you're using stratified from my "splitstackshape" package.

The error explains what's required: a named vector (something like c(a = 5, b = 10), for example).

However, that feature of the function assumes only one variable being used for stratification. To fix this, you can just create a new grouping variable by pasting together your "Loc" and "Format" columns.

Here's a simple example....

Start with some sample data of your original dataset and a dataset that indicates the sample sizes you want.

library(splitstackshape)
set.seed(1)
mydf <- data.table(strata1 = sample(letters[1:2], 25, TRUE), 
                   strata2 = sample(c("A", "B"), 25, TRUE), 
                   values = sample(25, replace = TRUE))
head(mydf)
#    strata1 strata2 values
# 1:       a       A     12
# 2:       a       A     22
# 3:       b       A     11
# 4:       b       B      7
# 5:       a       A      2
# 6:       b       A      3

wanted <- data.table(strata1 = c("a", "a", "b", "b"),
                     strata2 = c("A", "B", "A", "B"),
                     count = c(2, 3, 5, 2))
wanted
#    strata1 strata2 count
# 1:       a       A     2
# 2:       a       B     3
# 3:       b       A     5
# 4:       b       B     2

To get the output, we'll add a column called "KEY" combining the two stratifying columns. You can do that to both of the datasets, but I simply did it on the fly with the "wanted" dataset.

out <- stratified(
  mydf[, KEY := paste(strata1, strata2, sep = "_")], "KEY",
  with(wanted, setNames(count, paste(strata1, strata2, sep = "_"))))
out
#     strata1 strata2 values KEY
#  1:       a       A     21 a_A
#  2:       a       A      2 a_A
#  3:       a       B      9 a_B
#  4:       a       B      3 a_B
#  5:       a       B      9 a_B
#  6:       b       A     17 b_A
#  7:       b       A     12 b_A
#  8:       b       A      3 b_A
#  9:       b       A     17 b_A
# 10:       b       A     13 b_A
# 11:       b       B      8 b_B
# 12:       b       B     20 b_B

Compare the resulting sample sizes by the original stratification variables:

out[, .N, .(strata1, strata2)]
#    strata1 strata2 N
# 1:       a       A 2
# 2:       a       B 3
# 3:       b       A 5
# 4:       b       B 2

Thank you for the solution. Please be patient with me. I got an error message: Number of groups is 1 but number of sizes supplied is 46. I know the my "wanted" has 46 rows, but I don't understand the number of groups is 1 mean. The "KEY" column should have 46 unique values as well. — Huaying Pu, Oct 11 '17 at 15:22
@HuayingPu, you don't have to do it all in one step. Break it into multiple and see where the problem is. Check, for instance that the "KEY" column does indeed have the correct number of unique values. Your dataset is a `data.table`, correct? — A5C1D2H2I1M1N2O1R2T1, Oct 11 '17 at 15:25

stratified sampling size varies based on group in R

1 Answers1