I have a data set from a survey that has respondents classified by certain demographic data values. The layout of the data is basically this:
Gender Age Income Region
1 Male 1 2 West
2 Male 4 2 South
3 Male 4 3 West
4 Female 4 1 Northeast
5 Female 5 2 West
6 Female 3 2 West
7 Male 1 1 South
8 Male 3 3 Northeast
9 Female 2 3 West
10 Female 4 3 Midwest
I used this to generate the example above. I am open to comment on how to do this better of course.
regions <- c('Midwest','Northeast','South','West')
incomes <- c('1','2','3','4')
gender <- c('Male','Female')
age <- c('1','2','3','4','5')
data <- data.frame(stringsAsFactors = FALSE)
for(i in 1:100){
z <- data.frame(sample(gender,1),sample(age,1),sample(incomes,1),sample(regions,1))
data <- rbind(data, z)
}
colnames(data) <- c("Gender","Age","Income","Region")
data
I need to break that data set into subsets that each represent the original set. That would include each subset having the same percentage of genders, age groups, income group, and region. I understand exact representation might be difficult for that many factors and a small number of rows.
There is a second part to my problem. R has many built in functions that make unambiguously describing a problem like this difficult. Split, data, factors, values, subset, and words that we might use interchangeably when talking about data in general but do not exactly yield the right answer when typed into Google or Stack Overflow. I would like to know if there is more technically precise I should use to describe my problem.