2

I need to get out a sample of subjects from a list to assign them as a Control Group for a study which has to have a similar composition of variables. I am trying to do this in R with the sample function but I don´t know how to specify the differetnt probabilities for each variable. Lets say I have a table with the following headers:

ID Name Campaign Gender

I need a sample of 10 subjects with the following composition of Campaign attributes:

D2D --> 25%

F2F --> 38%

TM --> 17%

WW --> 21%

This means from my data set I have 25% of subjects coming from a Door to Door Campaign (D2D), 38% from a Face to Face Campaign (F2F), etc

And the gender composition is as following:

Male --> 54%

Female --> 46%

When I get a random sample of 10 subjects I need it to have a similar composition.

I have been searching for hours and the closest I was able to get to anything similar was this answer: taking data sample in R but I need to assign more than one probability.

I am sure that this could help anyone who wants to get a representative sample from a Data Set.

Community
  • 1
  • 1
vdBurg
  • 59
  • 1
  • 1
  • 8
  • 2
    look at parameter `prob` from `sample` function – storaged Jun 08 '13 at 17:38
  • 2
    I just answred this question by another user: http://stackoverflow.com/questions/17001808/generate-random-integers-between-two-values-with-a-given-probability-using-r/17001922#17001922. – Paul Hiemstra Jun 08 '13 at 17:51

1 Answers1

8

It sounds like you are interested in taking a random stratified sample. You could do this using the stratsample() function from the survey package.

In the example below, I create some fake data to mimic what you have, then I define a function to take a random proportional stratified random sample, then I apply the function to the fake data.

# example data
ndf <- 1000
df <- data.frame(ID=sample(ndf), Name=sample(ndf), 
    Campaign=sample(c("D2D", "F2F", "TM", "WW"), ndf, prob=c(0.25, 0.38, 0.17, 0.21), replace=TRUE),
    Gender=sample(c("Male", "Female"), ndf, prob=c(0.54, 0.46), replace=TRUE))

# function to take a random proportional stratified sample of size n
rpss <- function(stratum, n) {
    props <- table(stratum)/length(stratum)
    nstrat <- as.vector(round(n*props))
    nstrat[nstrat==0] <- 1
    names(nstrat) <- names(props)
    stratsample(stratum, nstrat)
    }

# take a random proportional stratified sample of size 10
selrows <- rpss(stratum=interaction(df$Campaign, df$Gender, drop=TRUE), n=10)
df[selrows, ]
Jean V. Adams
  • 4,634
  • 2
  • 29
  • 46
  • Many thanks Jean. This is what I was looking for; it makes it especially easy in case I have more variables I want to take into account. I have been trying to run tghe formula on my own data but it returns an error. I have been looking what the difference was between your fake data and mine but there isn´t much difference at first sight, so I tried to reduce the amount of rows to 90 on your fake data and then it gives me the exact same error as I am having on my own data: – vdBurg Jun 13 '13 at 17:42
  • >ndf <- 90 > df <- data.frame(ID=sample(ndf), Name=sample(ndf), + Campaign=sample(c("D2D", "F2F", "TM", "WW"), ndf, prob=c(0.25, 0.38, 0.17, 0.21), replace=TRUE), + Gender=sample(c("Male", "Female"), ndf, prob=c(0.54, 0.46), replace=TRUE)) > selrows <- rpss(stratum=interaction(df$Campaign, df$Gender), n=10) Error in rval[j + (1:counts[i])] <- sample(allrows[strata == thisstrat], : replacement has length zero – vdBurg Jun 13 '13 at 17:42
  • That appears to be a result of one stratum with few observations in the data (`df`) such that `n` times the proportion is rounded to zero. A quick and dirty fix is to make sure that each stratum has a sample size of at least one ... `rpss <- function(stratum, n) { props <- table(stratum)/length(stratum) nstrat <- as.vector(round(n*props)) nstrat[nstrat==0] <- 1 names(nstrat) <- names(props) stratsample(stratum, nstrat) }` – Jean V. Adams Jun 13 '13 at 18:11
  • Hi Jean, Sorry for the late responses, I have been busy and I wanted to look at the functions by self as well. What I don´t understand is that if you are using the stratsample function, which is without replacement as I read [Here](http://www.inside-r.org/packages/cran/survey/docs/stratsample) why the sample is giving me duplicates. I am struggling as well trying to get a sample with more variables... – vdBurg Jun 15 '13 at 10:45
  • For example When I run it it gives me an error: – vdBurg Jun 15 '13 at 10:46
  • Ah, I see. The problem here is that the `interaction()` function is keeping all possible interactions as levels, even if they don't exist in the data. You can fix this by adding the argument `drop=TRUE`. For, example: `selrows <- rpss(stratum=interaction(df$Campaign, df$Fee, df$Gender, drop=TRUE), n=10)` – Jean V. Adams Jun 17 '13 at 14:41
  • So cool Jean, so many thanks! I really think it will help many others too! – vdBurg Jun 17 '13 at 18:19
  • Glad it helped. I included the edits that came up in these comments in the answer for completeness. – Jean V. Adams Jun 17 '13 at 18:28
  • Jean, is [this](https://stackoverflow.com/q/64159397/7223434) also possible using your code? – rnorouzian Oct 01 '20 at 16:28
  • Not as is, @morouzian. Right now the function is set up take a random sample proportional to the data. The post you point to seems to have a fixed set of proportions in mind. – Jean V. Adams Oct 01 '20 at 21:17