I am trying to summarise data from a household survey and as such most of my data is categorical (factor) data. I was looking to summarise it with plots of frequencies of responses to certain questions (e.g., a bar plot of percentages of households answering certain questions, with error bars showing confidence intervals). I found this excellent tutorial which I had thought was the answer to my prayers (http://www.cookbook-r.com/Manipulating_data/Summarizing_data/) but turns out this is only going to help with continuous data.
What I need is something similar that will allow me to calculate proportions of counts and standard errors / confidence intervals of these proportions.
Essentially I want to be able to produce summary tables that look like this for each of the questions asked in my survey data:
# X5employf X5employff N(count) proportion SE of prop. ci of prop
# 1 1 20 0.64516129 ? ?
# 1 2 1 0.03225806 ? ?
# 1 3 9 0.29032258 ? ?
# 1 NA 1 0.290322581 ? ?
# 2 4 1 0.1 ? ?
structure(list(X5employf = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), X5employff = structure(c(1L, 2L, 3L, NA, 4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L), .Label = c("1", "2", "3", "4", "5", "6", "7", "8"), class = "factor"), count = c(20L, 1L, 9L, 1L, 1L, 5L, 2L, 1L, 1L, 4L, 5L, 4L, 1L)), .Names = c("X5employf", "X5employff", "count"), row.names = c(NA, -13L), class = "data.frame")
I would then want to plot barplots in ggplot (or similar) using these summary data with error bars showing the confidence intervals.
I had thought to amend the code provided in the tutorial above to calculate the columns above, though as a relative newcomer to R, am struggling a little! I have been experimenting with the ggply package but not so great on the syntax so I have managed to get as far as this with the following code:
> X5employ_props <- ddply(X5employ_counts, .(X5employf), transform, prop=count/sum(count))
But I end up with this:
X5employf X5employff count prop
1 1 1 20 1.0000000
2 1 2 1 1.0000000
3 1 3 9 1.0000000
4 2 4 1 0.2000000
5 3 4 4 0.8000000
6 2 5 5 0.5000000
7 3 5 5 0.5000000
8 2 6 2 0.3333333
9 3 6 4 0.6666667
10 2 7 1 0.5000000
11 3 7 1 0.5000000
12 2 8 1 1.0000000
13 1 <NA> 1 1.0000000
With all my proportions being 1, presumably because they are being calculated across rows and not columns
I wondered if anyone could help or knows of packages / code that would do the job for me!