5

I am trying to summarise data from a household survey and as such most of my data is categorical (factor) data. I was looking to summarise it with plots of frequencies of responses to certain questions (e.g., a bar plot of percentages of households answering certain questions, with error bars showing confidence intervals). I found this excellent tutorial which I had thought was the answer to my prayers (http://www.cookbook-r.com/Manipulating_data/Summarizing_data/) but turns out this is only going to help with continuous data.

What I need is something similar that will allow me to calculate proportions of counts and standard errors / confidence intervals of these proportions.

Essentially I want to be able to produce summary tables that look like this for each of the questions asked in my survey data:

# X5employf X5employff  N(count) proportion SE of prop.  ci of prop
#   1          1        20    0.64516129    ?             ?       
#   1          2         1    0.03225806    ?             ?  
#   1          3         9    0.29032258    ?             ?
#   1          NA        1    0.290322581    ?            ?
#   2          4             1    0.1            ?             ?


structure(list(X5employf = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), X5employff = structure(c(1L, 2L, 3L, NA, 4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L), .Label = c("1", "2", "3", "4", "5", "6", "7", "8"), class = "factor"), count = c(20L, 1L, 9L, 1L, 1L, 5L, 2L, 1L, 1L, 4L, 5L, 4L, 1L)), .Names = c("X5employf", "X5employff", "count"), row.names = c(NA, -13L), class = "data.frame")

I would then want to plot barplots in ggplot (or similar) using these summary data with error bars showing the confidence intervals.

I had thought to amend the code provided in the tutorial above to calculate the columns above, though as a relative newcomer to R, am struggling a little! I have been experimenting with the ggply package but not so great on the syntax so I have managed to get as far as this with the following code:

> X5employ_props <- ddply(X5employ_counts, .(X5employf), transform, prop=count/sum(count))

But I end up with this:

   X5employf X5employff count      prop
1          1          1    20 1.0000000
2          1          2     1 1.0000000
3          1          3     9 1.0000000
4          2          4     1 0.2000000
5          3          4     4 0.8000000
6          2          5     5 0.5000000
7          3          5     5 0.5000000
8          2          6     2 0.3333333
9          3          6     4 0.6666667
10         2          7     1 0.5000000
11         3          7     1 0.5000000
12         2          8     1 1.0000000
13         1       <NA>     1 1.0000000

With all my proportions being 1, presumably because they are being calculated across rows and not columns

I wondered if anyone could help or knows of packages / code that would do the job for me!

Uwe
  • 41,420
  • 11
  • 90
  • 134
marty_c
  • 5,779
  • 5
  • 24
  • 27
  • 1
    Are you aware of http://docs.ggplot2.org/current/geom_errorbar.html? You can plot a barplot with a `stat = "identity"` argument, see http://docs.ggplot2.org/current/geom_bar.html for further details. To get a better response, I suggest you provide us with some reproducible data. – Roman Luštrik Jul 23 '13 at 07:10
  • Hi Roman, yes I've read the ggplot2 documentation on geom_errorbar, and have produce my bar plots already. However, geom_errorbar requires you to specify limits for plotting error bars - that's why I'm trying to summarise my data first. Ideally, I'm looking for a way to automate this as I have 49 variables. – marty_c Jul 23 '13 at 08:02
  • the first three vectors integer `1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55` factor1 `1 3 1 1 1 3 1 1 1 3 1 1 1 2 2 3 3 3 1 2 2 2 2 2 1 1 1 3 3 3 3 3 3 2 1 1 3 1 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2` factor2 `1 4 1 2 4 3 1 1 6 1 1 1 5 5 6 7 5 1 6 6 7 5 4 1 3 1 6 5 5 5 6 4 5 3 3 5 1 4 5 1 1 1 1 1 3 3 3 1 3 1 1 1 3 8` – marty_c Jul 23 '13 at 08:14
  • See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example on how to dish out handy data. – Roman Luštrik Jul 23 '13 at 10:13
  • Oops sorry, my bad. How about this: – marty_c Jul 23 '13 at 17:12
  • Oops sorry, my bad. How about this:`structure(list(X5employf = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), X5employff = structure(c(1L, 2L, 3L, NA, 4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L), .Label = c("1", "2", "3", "4", "5", "6", "7", "8"), class = "factor"), count = c(20L, 1L, 9L, 1L, 1L, 5L, 2L, 1L, 1L, 4L, 5L, 4L, 1L)), .Names = c("X5employf", "X5employff", "count"), row.names = c(NA, -13L), class = "data.frame")` – marty_c Jul 23 '13 at 17:18
  • Did you have a [particular method](http://www.nucmed.com/documents/Academic/Grants/GBM_SVM_BN/Articles/Two-Sided%20Confidence%20Intervals%20for%20the%20Single%20Proportion.pdf) in mind? – dardisco Jul 23 '13 at 20:37
  • Using `binom.exact` from `epitools`, I've used `bsum <- ddply(bb,.(ttt),function(x) { n <- nrow(x) b <- binom.exact(sum(x$predation),n=n)[,c("n","proportion","lower","upper")] as.data.frame(rename(b,c(proportion="Mean",lower="Lower",upper="Upper"))) }) ` before ... – Ben Bolker Jul 24 '13 at 21:34

2 Answers2

3

There are numerous methods to calculate binomial confidence intervals and I doubt there is much consensus on which method is best. That said, here is one approach to calculate binomial confidence intervals using several different methods. I am not sure whether this helps.

library(binom)

x <- c(3, 4, 5, 6, 7)
n <- rep(10, length(x))

binom.confint(x, n, conf.level = 0.95, methods = "all")

          method x  n      mean      lower     upper
1  agresti-coull 3 10 0.3000000 0.10333842 0.6076747
2  agresti-coull 4 10 0.4000000 0.16711063 0.6883959
3  agresti-coull 5 10 0.5000000 0.23659309 0.7634069
4  agresti-coull 6 10 0.6000000 0.31160407 0.8328894
5  agresti-coull 7 10 0.7000000 0.39232530 0.8966616
6     asymptotic 3 10 0.3000000 0.01597423 0.5840258
7     asymptotic 4 10 0.4000000 0.09636369 0.7036363
8     asymptotic 5 10 0.5000000 0.19010248 0.8098975
9     asymptotic 6 10 0.6000000 0.29636369 0.9036363
10    asymptotic 7 10 0.7000000 0.41597423 0.9840258
11         bayes 3 10 0.3181818 0.09269460 0.6058183
12         bayes 4 10 0.4090909 0.15306710 0.6963205
13         bayes 5 10 0.5000000 0.22352867 0.7764713
14         bayes 6 10 0.5909091 0.30367949 0.8469329
15         bayes 7 10 0.6818182 0.39418168 0.9073054
16       cloglog 3 10 0.3000000 0.07113449 0.5778673
17       cloglog 4 10 0.4000000 0.12269317 0.6702046
18       cloglog 5 10 0.5000000 0.18360559 0.7531741
19       cloglog 6 10 0.6000000 0.25266890 0.8272210
20       cloglog 7 10 0.7000000 0.32871659 0.8919490
21         exact 3 10 0.3000000 0.06673951 0.6524529
22         exact 4 10 0.4000000 0.12155226 0.7376219
23         exact 5 10 0.5000000 0.18708603 0.8129140
24         exact 6 10 0.6000000 0.26237808 0.8784477
25         exact 7 10 0.7000000 0.34754715 0.9332605
26         logit 3 10 0.3000000 0.09976832 0.6236819
27         logit 4 10 0.4000000 0.15834201 0.7025951
28         logit 5 10 0.5000000 0.22450735 0.7754927
29         logit 6 10 0.6000000 0.29740491 0.8416580
30         logit 7 10 0.7000000 0.37631807 0.9002317
31        probit 3 10 0.3000000 0.08991347 0.6150429
32        probit 4 10 0.4000000 0.14933907 0.7028372
33        probit 5 10 0.5000000 0.21863901 0.7813610
34        probit 6 10 0.6000000 0.29716285 0.8506609
35        probit 7 10 0.7000000 0.38495714 0.9100865
36       profile 3 10 0.3000000 0.08470272 0.6065091
37       profile 4 10 0.4000000 0.14570633 0.6999845
38       profile 5 10 0.5000000 0.21765974 0.7823403
39       profile 6 10 0.6000000 0.30001552 0.8542937
40       profile 7 10 0.7000000 0.39349089 0.9152973
41           lrt 3 10 0.3000000 0.08458545 0.6065389
42           lrt 4 10 0.4000000 0.14564246 0.7000216
43           lrt 5 10 0.5000000 0.21762124 0.7823788
44           lrt 6 10 0.6000000 0.29997837 0.8543575
45           lrt 7 10 0.7000000 0.39346107 0.9154146
46     prop.test 3 10 0.3000000 0.08094782 0.6463293
47     prop.test 4 10 0.4000000 0.13693056 0.7263303
48     prop.test 5 10 0.5000000 0.20142297 0.7985770
49     prop.test 6 10 0.6000000 0.27366969 0.8630694
50     prop.test 7 10 0.7000000 0.35367072 0.9190522
51        wilson 3 10 0.3000000 0.10779127 0.6032219
52        wilson 4 10 0.4000000 0.16818033 0.6873262
53        wilson 5 10 0.5000000 0.23659309 0.7634069
54        wilson 6 10 0.6000000 0.31267377 0.8318197
55        wilson 7 10 0.7000000 0.39677815 0.8922087

I am not entirely sure what you want, but here is code to create a table that I think contains all of the parameters you are after. I dug the code out of Package binom using the Agresti-Coull method.

conf.level <- 0.95

x <-  c( 4, 5, 6)     # successes
n <-  c(10,10,10)     # trials

method <- 'ac'

# source code from package binom:

xn <- data.frame(x = x, n = n)
  all.methods <- any(method == "all")
  p <- x/n
  alpha <- 1 - conf.level
  alpha <- rep(alpha, length = length(p))
  alpha2 <- 0.5 * alpha
  z <- qnorm(1 - alpha2)
  z2 <- z * z
  res <- NULL
  if(any(method %in% c("agresti-coull", "ac")) || all.methods) {
    .x <- x + 0.5 * z2
    .n <- n + z2
    .p <- .x/.n
    lcl <- .p - z * sqrt(.p * (1 - .p)/.n)
    ucl <- .p + z * sqrt(.p * (1 - .p)/.n)
    res.ac <- data.frame(method = rep("agresti-coull", NROW(x)),
                         xn, mean = p, lower = lcl, upper = ucl)
    res <- res.ac    
  }

SE <- sqrt(.p * (1 - .p)/.n)
SE

See also: http://www.stat.sc.edu/~hendrixl/stat205/Lecture%20Notes/Confidence%20Interval%20for%20the%20Population%20Proportion.pdf

Here is the table containing all data and parameters.

my.table <- data.frame(res, SE)
my.table

         method x  n mean     lower     upper        SE
1 agresti-coull 4 10  0.4 0.1671106 0.6883959 0.1329834
2 agresti-coull 5 10  0.5 0.2365931 0.7634069 0.1343937
3 agresti-coull 6 10  0.6 0.3116041 0.8328894 0.1329834

I have not yet checked to see whether these estimates match any examples in Agresti's books. However, the first R function below from the University of Florida returns the same CI estimates as package binom. The second R function below from the University of Florida does not.

http://www.stat.ufl.edu/~aa/cda/R/one-sample/R1/

x <- 4
n <- 10
conflev <- 0.95

addz2ci <- function(x,n,conflev){
   z = abs(qnorm((1-conflev)/2))
   tr = z^2     #the number of trials added
   suc = tr/2   #the number of successes added
   ptilde = (x+suc)/(n+tr)
   stderr = sqrt(ptilde * (1-ptilde)/(n+tr))
   ul = ptilde + z * stderr
   ll = ptilde - z * stderr
   if(ll < 0) ll = 0
   if(ul > 1) ul = 1
   c(ll,ul)
}
# Computes the Agresti-Coull CI for x successes out of n trials
# with confidence coefficient conflev. 

add4ci <- function(x,n,conflev){
   ptilde = (x+2)/(n+4)
   z = abs(qnorm((1-conflev)/2))
   stderr = sqrt(ptilde * (1-ptilde)/(n+4))
   ul = ptilde + z * stderr
   ll = ptilde - z * stderr
   if(ll < 0) ll = 0
   if(ul > 1) ul = 1
   c(ll,ul)
}
# Computes the Agresti-Coull `add 4' CI for x successes out of n trials
# with confidence coefficient conflev. Adds 2 successes and
# 4 trials.

Note also that according to the first link above the Agresti-Coull interval is not recommended when n < 40.

As for the other packages you mentioned, I rarely use them, but I am pretty sure you could include the above code in R script that calls those packages.

Mark Miller
  • 12,483
  • 23
  • 78
  • 132
  • hi Mark, thanks, that's nearly what I want. But am I able ti combine one element of the binom.confint argument into ddply package (or other). I'm hoping to automate the creation of tables that contain all of these values plus count, standard error etc. which I can then plot in ggplot... I'm trying to get to a point where I can churn out these plots for any variable/set of variables in my data table quickly and with 'pretty' results....:) – marty_c Jul 24 '13 at 16:54
  • Hi mark, thanks again for your speedy response, but I'm not sure its quite what I need. First and foremost I need the proportions of each factor as per my original example at the top of the page and from this the standard error/confidence intervals for each variable. e.g. from my example at the top the proportion of females taking aspirin out of all female patients; the proportion of female patients taking a placebo out of all female patients; proportion of males taking aspirin out of all male. etc.. Your example gives a mean, and I can't work out how you arrive at a mean from a proportion?? – marty_c Jul 27 '13 at 06:36
  • @marty_c I am not sure I know how to help further because I do not know where the proportions in your table come from: the 0.9979145, 0.5247655, 1.0674848, and 0.5291503. If you edit your post to show exactly how you get those four numbers maybe I can help more. If those four proportions are unrelated to your example data set then perhaps show what the four proportions are that come from your example data set and perhaps show exactly how to get those four proportions 'by hand'. – Mark Miller Jul 27 '13 at 07:02
  • mark, have now edited my original table at the top of the post with hand calculated values of the proportions. It is essential the proportion of observations of X5employff grouped by X5employf e.g. the first line is 20/31 (where 31 is the total number of observations in group X5employf = 1. thanks – marty_c Jul 27 '13 at 07:28
2

Here is a method for estimating multinomial 95% confidence intervals.

library(MultinomialCI)

x <- c(20,1,9,1)

multinomialCI(x,alpha=0.05,verbose=FALSE)

#           [,1]      [,2]
# [1,] 0.5161290 0.8322532
# [2,] 0.0000000 0.2193499
# [3,] 0.1612903 0.4774145
# [4,] 0.0000000 0.2193499

I have not yet looked at the source code to see how to obtain the SE's.

Mark Miller
  • 12,483
  • 23
  • 78
  • 132