I'm a university student, beginning to explore R for an exam. Sorry for the vague title, as I have many questions related to this post.
I've run into the problem of sampling a population of people who are either Male (M) or Female (F). I wished to define a function that could take the number of Males and Females in this population, then create sample.number
samples of size sample.size
and return a data frame containing the sample proportions of females over the total size of the sample, with related frequencies.
I'm positive there is a simple and well-optimized way to do this, but I've written a small function that (barely) works:
senators <- function(Fem = 13,
Mal = 87,
sample.size = 10,
sample.number = 100){
pop <- c(rep("F", Fem), rep("M", Mal)) # I create the population base
popsa <- list(NA) # I make some empty variables used later
popsa.factor <- list(NA) # Not sure if this passage is even needed...
popsa.proportion <- list(NA)
Here comes a for
loop. I've read that for
loops are really inefficient way to do this. Is there a better way?
for(i in 1:sample.number){
popsa[[i]] <- sample(pop, sample.size, replace = TRUE)
popsa.factor[[i]] <- table(factor(popsa[[i]], levels = c("M", "F")))
popsa.proportion[[i]] <- popsa.factor[[i]][2]/sample.size
}
I start by assigning each element of the list popsa
with a sample, then I use popsa
to create a table from each sample, and store it in popsa.factor
. Then I calculate the proportions of females over the total and store it in popsa.proportion
. This for
loop seems super messy to me, and is really slow to process lots of samples. Is there a better, more efficient way to do what I've done here?
popsa.unlisted <- unlist(popsa.proportion)
popsa.frequency <- table(popsa.unlisted)
popsa.frame <- data.frame(Level = as.numeric(names(popsa.frequency)),
Freq = as.numeric(popsa.frequency))
return(popsa.frame)
} # This closes the function call
I then unlist popsa.proportion
to get every proportion in a vector, and table those values to get the frequencies, storing them into popsa.frequency
. Now I try to turn the factor popsa.frequency
into a data frame, by cheating and converting the names of popsa.frequency
as numeric and storing them as the first column of the data frame. The function then returns popsa.frame
, as I wanted.
popsa.frame
, though, still carries over the factor properties of popsa.frequency
in its first column (Level
). How can I change this? Should I?
Since these are frequencies of a sample distribution, I'd like to create an histogram from this dataframe, although hist()
only accepts numeric vectors, so popsa.frame
isn't a valid object. plot(popsa.frame)
returns more or less what I want, though. How can I create such an histogram?
Edit: Following the marked answer below, I've also come up on how to simply convert the data frame the function creates into an object that hist()
can actually use to create a frequency histogram (although using a barplot yields more or less the same graph, and possibly be a more statistically correct way to show such a result):
result <- senators(Fem=13,Mal=87,sample.size=50,sample.number=10000)
raw <- sapply(1:length(result$Level), function(x){
rep(result$Level, result$Freq)
})
hist(raw)