Selecting a sample to match the distribution of variables in another dataset

Question

Let x be a dataset with 5 variables and 15 observations:

age gender  height  weight  fitness
17  M   5.34    68  medium
23  F   5.58    55  medium
25  M   5.96    64  high
25  M   5.25    60  medium
18  M   5.57    60  low
17  F   5.74    61  low
17  M   5.96    71  medium
22  F   5.56    75  high
16  F   5.02    56  medium
21  F   5.18    63  low
20  M   5.24    57  medium
15  F   5.47    72  medium
16  M   5.47    61  high
22  F   5.88    73  low
18  F   5.73    62  medium

The frequencies of the values for the fitness variable are as follows: low = 4, medium = 8, high = 3.

Suppose I have another dataset y with the same 5 variables but 100 observations. The frequencies of the values for the fitness variable in this dataset are as follows: low = 42, medium = 45, high = 13.

Using R, how can I obtain a representative sample from y such that the sample fitness closely matches the distribution of the fitness in x?

My initial ideas were to use the sample function in R and assign weighted probabilities for the prob argument. However, using probabilities would force an exact match for the frequency distribution. My objective is to get a close enough match while maximizing the the sample size.

Additionally, suppose I wish to add another constraint where the distribution of the gender must also closely match that of x?

I think you can sample at most 22, 45, and 17 from y, for a total of 84 (out of the 100). This gives proportions of 0.26, 0.54, and 0.20, which closely match that of x (0.27, 0.53, 0.20). — Edward, Mar 06 '20 at 02:28
But how exactly do I execute this and let R do the sampling for me? Note also that I can't possibly sample 17 high fitness values since the max is 13. The sample distribution doesn't necessarily have to be that close anyway, just enough to pass for a representative sample of x. I suppose the size of the sample is not as pressing of an issue for me as practically obtaining a sample in the first place. I realize too that the more constraints I put, the smaller the sample size will be anyway. — Outlier, Mar 06 '20 at 02:43
Ahh, yes. I forgot to add that constraint. So adjust the 84 by subtracting a certain amount from each and then recalculate the sample sizes to get 18, 35 and 13, which give proportions of 0.27, 0.53 and 0.20. — Edward, Mar 06 '20 at 02:53

Edward · Answer 1 · 2020-03-06T04:11:04.017

The minimum frequency in your y is 13, corresponding to the "high" fitness level. So you can't sample more than this number. That's your first constraint. You want to maximize your sample size, so you sample all 13. To match the proportions in x, 13 should be 20% of your total, which means your total must be 65 (13/0.2). The other frequencies must therefore be 17 (low) and 35 (moderate). Since you have enough of these fitness levels in your y, you can take this as your sample. If any of the other sample frequencies exceeded the number in y, then you'd have another constraint and would have to adjust these accordingly.

For sampling, you'd first select all records with "high" fitness (sampling with certainty). Next, sample from the other levels separately (stratified random sampling). Finally, combine all three.

Example:

rm(list=ls())
# set-up the data (your "y"):
df <- data.frame(age=round(rnorm(100, 20, 5)), 
                 gender=factor(gl(2,50), labels=LETTERS[c(6, 13)]), 
                 height=round(rnorm(100, 12, 3)), 
                 fitness=factor(c(rep("low", 42), rep("medium", 45), rep("high", 13)), 
                                levels=c("low","medium","high")))

Create subsets for sampling:

fit.low <- subset(df, subset=fitness=="low")
fit.medium <- subset(df, subset=fitness=="medium")
fit.high <- subset(df, subset=fitness=="high")

Sample 17 from the low fitness group (40.5% or 26.7% of the total).

fit.low_sam <- fit.low[sample(1:42, 17),]

Sample 35 from the medium fitness group (77.8% or 53.8% of the total).

fit.med_sam <- fit.medium[sample(1:45, 35),]

Combine them all.

fit.sam <- rbind(fit.low_sam, fit.med_sam, fit.high)

I tried to do this using the sample_n and sample_frac functions from dplyr but I think these functions don't allow you to do stratified sampling with different proportions.

library(dplyr)
df %>%
  group_by(fitness) %>%
  sample_n(size=c(17,35,13), weight=c(0.27, 0.53, 0.2))
# Error

But the sampling package can certainly do this. Stratified random sampling from data frame

library(sampling)
s <- strata(df, "fitness", size=c(17,35,13), "srswor")
getdata(df, s)

Yes, I understand much more clearly now! The problem really is one of stratified sampling so thanks for recommending the sampling package. — Outlier, Mar 06 '20 at 04:10

Istrel · Accepted Answer · 2020-03-06T09:06:29.120

Consider using rmultinom to prepare samples counts in each level of fitness.

Prepare the data (I have used y preparation from @Edward answer)

x <- read.table(text = "age gender  height  weight  fitness
17  M   5.34    68  medium
23  F   5.58    55  medium
25  M   5.96    64  high
25  M   5.25    60  medium
18  M   5.57    60  low
17  F   5.74    61  low
17  M   5.96    71  medium
22  F   5.56    75  high
16  F   5.02    56  medium
21  F   5.18    63  low
20  M   5.24    57  medium
15  F   5.47    72  medium
16  M   5.47    61  high
22  F   5.88    73  low
18  F   5.73    62  medium", header = TRUE)

y <- data.frame(age=round(rnorm(100, 20, 5)), 
                 gender=factor(gl(2,50), labels=LETTERS[c(6, 13)]), 
                 height=round(rnorm(100, 12, 3)), 
                 fitness=factor(c(rep("low", 42), rep("medium", 45), rep("high", 13)), 
                                levels=c("low","medium","high")))

Now the sampling procedure: UPD: I have changed the code for two variables case (gender and fitness)

library(tidyverse)

N_SAMPLES = 100

# Calculate frequencies
freq <- x %>%
    group_by(fitness, gender) %>% # You can set any combination of factors
    summarise(freq = n() / nrow(x)) 

# Prepare multinomial distribution
distr <- rmultinom(N_SAMPLES, 1, freq$freq)
# Convert to counts
freq$counts <- rowSums(distr)

# Join y with frequency for further use in sampling
y_count <- y %>% left_join(freq)

# Perform sampling using multinomial distribution counts
y_sampled <- y_count %>%
    group_by(fitness, gender) %>% # Should be the same as in frequencies calculation
    # Check if count is greater then number of observations
    sample_n(size = ifelse(n() > first(counts), first(counts), n()),
        replace = FALSE) %>%
    select(-freq, -counts)

When I run summarise(freq = n() / nrow(x)) , I'm getting an error: n() should only be called in a data context. — Outlier, Mar 06 '20 at 21:01
Never mind, I found out it was just due to conflict in packages, since I loaded dplyr previously. — Outlier, Mar 06 '20 at 21:09

Selecting a sample to match the distribution of variables in another dataset

2 Answers2