R Splitting data into representative subsets based on values

Question

I have a data set from a survey that has respondents classified by certain demographic data values. The layout of the data is basically this:

      Gender   Age    Income    Region
1     Male     1      2         West
2     Male     4      2         South
3     Male     4      3         West
4     Female   4      1         Northeast
5     Female   5      2         West
6     Female   3      2         West
7     Male     1      1         South
8     Male     3      3         Northeast
9     Female   2      3         West
10    Female   4      3         Midwest

I used this to generate the example above. I am open to comment on how to do this better of course.

regions <- c('Midwest','Northeast','South','West')
incomes <- c('1','2','3','4')
gender <- c('Male','Female')
age <- c('1','2','3','4','5')
data <- data.frame(stringsAsFactors = FALSE)

for(i in 1:100){  
  z <- data.frame(sample(gender,1),sample(age,1),sample(incomes,1),sample(regions,1))  
  data <- rbind(data, z)
}
colnames(data) <- c("Gender","Age","Income","Region")
data

I need to break that data set into subsets that each represent the original set. That would include each subset having the same percentage of genders, age groups, income group, and region. I understand exact representation might be difficult for that many factors and a small number of rows.

There is a second part to my problem. R has many built in functions that make unambiguously describing a problem like this difficult. Split, data, factors, values, subset, and words that we might use interchangeably when talking about data in general but do not exactly yield the right answer when typed into Google or Stack Overflow. I would like to know if there is more technically precise I should use to describe my problem.

Not sure if this helps `df1 <- do.call(expand.grid, list(regions, incomes, gender, age)); df2 <- df1[sample(1:nrow(df1), 100, replace=FALSE),]` — akrun, May 22 '15 at 17:36
It sounds from your description like you are trying to sample from the data (by creating subsets that have the same characteristics as the larger dataset). That's an unusual thing to do and it would help if you explained why you need these subsets (as opposed to say, doing by-group analyses of gender, or region, and so on.) — TARehman, May 22 '15 at 18:02
We have a large group of people to pull from, but did not know they were going to be subset afterwards. If we did, we could have assigned these groups during collection Each person is being recruited into testing a product. I would need some subsets that would then each be analyzed afterwards. For example, each group would be testing a new soda and we want a general idea of how people across all demographics would respond. The actual data set we are using has people that are representative of the US as a whole as far as race, income, region etc. — Luke Brumfield, May 22 '15 at 18:08
So it is, essentially, a sampling problem (wanting to create "representative samples" from the larger dataset). Would the material in this question (http://stackoverflow.com/questions/16493920/how-can-i-ensure-that-a-partition-has-representative-observations-from-each-leve) be of use for you? — TARehman, May 22 '15 at 18:52
It look like @akrun 's comment is quite close to a good answer, imo. I would add that in `expand.grid` you might want to only use unique values of those items. It's also important to note that your example data are all factors/ discrete data. If income was a continuous variable this particular approach might not work as well. — rbatt, May 22 '15 at 19:19
@TARehman That solution had a link that I found useful (the createDataPartition) but the "stratified" and "createDataPartition" functions are limited to choosing one variable. I did some testing using only one "grouping variable" as it was called in stratified, and the subset it pulled out seemed representative. My initial question would have wanted a vector to be put in place of a single value so that there would be multiple grouping variables. — Luke Brumfield, May 22 '15 at 20:12
A quick fix could be `paste` (with collapse set) used on a few columns to have one column as input `createDataPartition`. — Pafnucy, May 22 '15 at 20:32
You could do this quite easily using `dplyr` by doing: `sample_dat <- dat %>% group_by(Gender, Age, Income, Region) %>% sample_frac(0.5)` which would sample 50% of your observations by the variables in `group_by`. And the 2nd part of your question, you should simply search "stratified sampling". The type of routine you're looking at is sampling without replacement though you could also sample with replacement. — chappers, May 23 '15 at 01:17

R Splitting data into representative subsets based on values

0 Answers0