R - subsetting original data frame: N random observations, 50% of N has ethnicity E and 50% of N has a education E

Question

Hi Stackoverflow users,

I am new at R, and have been learning for a couple of weeks only. I have a data frame with 15 string variables on people's characteristics (e.g. ethnicity, education, country of origin); one row is one person.

How can I tell R to create a subset of the original data frame such that this new data frame includes N random people (who have been drawn with replacement), 50% of N has Ethnicity ET, and 50% of N has Education ED? I know the basic A) and B)

A) I know how to draw N observations at random with replacement, as suggested here and here. For example:

df[sample(nrow(df), size=N, replace=TRUE), ]

B) In this other post, there are examples on how to condition the random draw (without replacement).

df[ sample( which( df$Ethnicity== "ET" | df$Education= "ED" ) , N ) , ]

However, I would like to know how to make more complex conditional draws, that is, 50% of N has to have Ethnicity ET, and 50% of N has to have Education ED. Thus, in this new sample of size N, the two conditions only partially intersect: for some people Ethnicity==ET & Education==ED, for some people Ethnicity!=ET & Education==ED, for some people Ethnicity==ET & Education!=ED, for some people Ethnicity!=ET & Education!=ED.

@Andreas Now I have expanded the post to show what I know, what I have found on stackoverflow and give some examples. However, what I am asking in this post is very general and quite complex (given my limited knowledge), I cannot see how this additional information could help. — Fuca26, Oct 07 '19 at 03:29

GKi · Accepted Answer · 2020-04-21T07:19:05.263

A simple solution would be to sample 1/4 for each combination, hoping that this combination exists:

n  <- 1e2 / 4
y <- x[c(sample(which(x$et & x$ed), n, TRUE)
         , sample(which(!x$et & x$ed), n, TRUE)
         , sample(which(x$et & !x$ed), n, TRUE)
         , sample(which(!x$et & !x$ed), n, TRUE)),]
table(y)
#       ed
#et      FALSE TRUE
#  FALSE    25   25
#  TRUE     25   25

In case there is a combination which does not exist you can get the proportion of each combination with table like:

n  <- 1e2
x  <- x[!x$et | x$ed,]
tt  <- table(x)
tt  <- tt * t(tt)
tt <- tt / rowSums(tt) 
tt <- tt / rep(colSums(tt), each=2)
tt <- round(proportions(tt)*n) #Since R 4.0.0: prop.table becomes proportions
#tt <- round(prop.table(tt)*n) #Here the target number might not be reached
y <- x[c(sample(which(!x$et & !x$ed), tt[1], TRUE)
         , sample(which(x$et & !x$ed), tt[2], TRUE)
         , sample(which(!x$et & x$ed), tt[3], TRUE)
         , sample(which(x$et & x$ed), tt[4], TRUE)),]
table(y)
#       ed
#et      FALSE TRUE
#  FALSE    50    0
#  TRUE      0   50

Data:

set.seed(7)
n  <- 1e4
x  <- data.frame(et=sample(c(TRUE,FALSE), n, TRUE, c(.25,.75)), ed=sample(c(TRUE,FALSE), n, TRUE, c(.75,.25)))

Is it ok if I turn the subsample into a data frame simply writing: ```y= data.frame(y)``` after ```table(y)``` — Fuca26, Oct 07 '19 at 20:21
what are you doing here: ```x <- data.frame(et=sample(c(TRUE,FALSE), n, TRUE, c(.25,.75)), ed=sample(c(TRUE,FALSE), n, TRUE, c(.75,.25)))```? — Fuca26, Oct 07 '19 at 20:22

R - subsetting original data frame: N random observations, 50% of N has ethnicity E and 50% of N has a education E

1 Answers1