Creating similar samples based on three different categorical variables

Question

I am trying to do an analysis where I am trying to create two similar samples based on three different attributes. I want to create these samples first and then do the analysis to see which out of those two samples is better. The categorical variables are sales_group, age_group, and country. So I want to make both samples such as the proportion of countries, age, and sales is similar in both samples.

For example: Sample A and B have following variables in it: Id Country Age Sales

The proportion of Country in Sample A is:

USA- 58% UK- 22% India-8% France- 6% Germany- 6%

The proportion of country in Sample B is: India- 42% UK- 36% USA-12% France-3% Germany- 5%

The same goes for other categorical variables: age_group, and sales_group

Thanks in advance for help

Sample data and desired results would really help, as would some definition or guidance on what you mean by "sample", how large they are, how much data you have, and so on. — Gordon Linoff, Jul 12 '18 at 21:03
When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Jul 12 '18 at 21:05

score 0 · Answer 1 · answered Aug 11 '18 at 17:56

You do not need to establish special procedure for sampling as one-sample proportion is unbiased estimate of population proportion. In case you have, suppose, >1000 observations and you are sampling more than, let us say, 30 samples the estimate would be quite exact (Central Limit Theorem). You can see it in the simulation below:

set.seed(123)
n <- 10000 # Amount of rows in the source data frame
df <- data.frame(sales_group = sample(LETTERS[1:4], n, replace = TRUE), 
                 age_group = sample(c("old", "young"), n, replace = TRUE), 
                 country = sample(c("USA", "UK", "India", "France", "Germany"), n, replace = TRUE),
                 amount = abs(100 * rnorm(n))) 

s <- 100 # Amount of sampled rows
sampleA <- df[sample(nrow(df), s), ]
sampleB <- df[sample(nrow(df), s), ]

table(sampleA$sales_group)
# A  B  C  D 
# 23 22 32 23 

table(sampleB$sales_group)
# A  B  C  D 
# 25 22 28 25

DISCLAIMER: However if you have some very small or very big proportion and have too little samples you will need to use some advanced procedures like Laplace smoothing

Creating similar samples based on three different categorical variables

1 Answers1