0

I'm a university student, beginning to explore R for an exam. Sorry for the vague title, as I have many questions related to this post.

I've run into the problem of sampling a population of people who are either Male (M) or Female (F). I wished to define a function that could take the number of Males and Females in this population, then create sample.number samples of size sample.size and return a data frame containing the sample proportions of females over the total size of the sample, with related frequencies.

I'm positive there is a simple and well-optimized way to do this, but I've written a small function that (barely) works:

senators <- function(Fem = 13, 
                 Mal = 87, 
                 sample.size = 10, 
                 sample.number = 100){

pop <- c(rep("F", Fem), rep("M", Mal)) # I create the population base

popsa <- list(NA)           # I make some empty variables used later
popsa.factor <- list(NA)    # Not sure if this passage is even needed...
popsa.proportion <- list(NA)

Here comes a for loop. I've read that for loops are really inefficient way to do this. Is there a better way?

for(i in 1:sample.number){
  popsa[[i]] <- sample(pop, sample.size, replace = TRUE)
  popsa.factor[[i]] <- table(factor(popsa[[i]], levels = c("M", "F")))
  popsa.proportion[[i]] <- popsa.factor[[i]][2]/sample.size
  }

I start by assigning each element of the list popsa with a sample, then I use popsa to create a table from each sample, and store it in popsa.factor. Then I calculate the proportions of females over the total and store it in popsa.proportion. This for loop seems super messy to me, and is really slow to process lots of samples. Is there a better, more efficient way to do what I've done here?

popsa.unlisted <- unlist(popsa.proportion)
popsa.frequency <- table(popsa.unlisted)

popsa.frame <- data.frame(Level = as.numeric(names(popsa.frequency)), 
                          Freq =  as.numeric(popsa.frequency))
return(popsa.frame)
} # This closes the function call

I then unlist popsa.proportion to get every proportion in a vector, and table those values to get the frequencies, storing them into popsa.frequency. Now I try to turn the factor popsa.frequency into a data frame, by cheating and converting the names of popsa.frequency as numeric and storing them as the first column of the data frame. The function then returns popsa.frame, as I wanted.

popsa.frame, though, still carries over the factor properties of popsa.frequency in its first column (Level). How can I change this? Should I?

Since these are frequencies of a sample distribution, I'd like to create an histogram from this dataframe, although hist() only accepts numeric vectors, so popsa.frame isn't a valid object. plot(popsa.frame) returns more or less what I want, though. How can I create such an histogram?

Edit: Following the marked answer below, I've also come up on how to simply convert the data frame the function creates into an object that hist() can actually use to create a frequency histogram (although using a barplot yields more or less the same graph, and possibly be a more statistically correct way to show such a result):

result <- senators(Fem=13,Mal=87,sample.size=50,sample.number=10000)

raw <- sapply(1:length(result$Level), function(x){
  rep(result$Level, result$Freq)
})

hist(raw)
  • So, you want to do histograms for each column of you `data.frame`? – patL Dec 27 '17 at 15:22
  • Not exactly, I wish to create a single histogram, where in the "y" axis are the frequencies, and in the "x" axis are the proportion values. @patL Something like [This](https://i.imgur.com/pgSRKX9.png), but with the columns of an histogram. – Luca Visentin Dec 27 '17 at 15:29

2 Answers2

0

Your function has some default values that leads to the creation of a data.frame by just doing senators().

Following your data I would do:

df <- senators() # using default values
plot(df, type="h", lwd = 5, lend=1) # type changes your plot type while lwd changes line sizes, while lend would give squared aspect yo your bars.

Take a look at ?plot to see the types of plots you can do. Also, you can see how change parameters by doing ?par.

P.S.: look at this post for line width details.

patL
  • 2,259
  • 1
  • 17
  • 38
0

The creation of the lists and the for loop has some performance bottlenecks. I was able to use sapply to remove the for loop and some of the temporary variables.

I am still returning the data fame and another option would return the vector answer just pass the result to the histogram plotting function for your final plot.

senators <- function(Fem = 13, 
                     Mal = 87, 
                     sample.size = 10, 
                     sample.number = 100){

  pop <- c(rep("F", Fem), rep("M", Mal)) # I create the population base

  answer<-sapply(1:sample.number, function(x){popsa <- sample(pop, sample.size, replace = TRUE);
                                            length(popsa[popsa=="F"])/sample.size})

popsa.frequency <- table(answer)

popsa.frame <- data.frame(Level = as.numeric(names(popsa.frequency)), 
                          Freq =  as.numeric(popsa.frequency))
return(popsa.frame)
} 

senators()   
Dave2e
  • 22,192
  • 18
  • 42
  • 50