3

I'm attempting to use ggplot and R for analysing some epidemiologic data, and I'm continuing to struggle with getting an epidemic curve to appear properly.

Data is here

attach(epicurve)
head(epicurve)

     onset age
    1 21/12/2012  18
    2 14/06/2013   8
    3 10/06/2013  64
    4 28/05/2013  79
    5 14/04/2013  56
    6  9/04/2013  66

epicurve$onset <- as.Date(epicurve$onset, format="%d/%m/%Y")
ggplot(epicurve, aes(onset)) + geom_histogram() + scale_x_date(breaks=date_breaks("1 year"), minor_breaks=date_breaks("1 month"), labels = date_format("%b-%Y"))

gives this graph. This is fine, but the binwidths are not related to any time period of note, and adjusting them is a bit trial and error.

For this particular dataset, I'd like to display the cases by month of onset.

One way I worked out how to do this is:

epicurve$monyr <- format(epicurve$onset, "%b-%Y")
epicurve$monyr <- as.factor(epicurve$monyr)
ggplot(epicurve, aes(monyr)) + geom_histogram()

Outputs a graph I can't post because of the reputation system. The bars represent something meaningful, but the axis labels are a bomb-site. I can't format the axes using scale_x_date because they aren't dates and I can't work out what arguments to pass to scale_x_discrete to give useful labels.

I have a feeling there should be an easier way to do this by doing an operation on the onset column. Can anyone give me any pointers, please?

Trent
  • 55
  • 1
  • 8
  • 1
    See: http://stackoverflow.com/questions/10770698/understanding-dates-and-plotting-a-histogram-with-ggplot2-in-r for some helpful examples. – Nate Pope Sep 18 '13 at 02:01
  • @NatePope - the issue with Gauden's approach is still that the bins are still arbitrary and the ticks are aligned to fit the bins, rather than the bins representing a calendar month and being centered appropriately over the ticks. – Trent Sep 18 '13 at 02:17
  • 2
    perhaps I'm not clear on what you're after. Gauden's updated answer has explicitly defined bin widths - 2 months - and labels formatted to 'year-month'. It seems easy enough to modify this to have breaks at every calender month. – Nate Pope Sep 18 '13 at 02:32
  • @NatePope i was just going through the OP's Gauden solution and i'm getting a `non-numeric argument to binary operator` when I try to add the `scale_x_date` part. This is despite me running `epicurve$onset <- as.Date(epicurve$onset)` before the rest of the code. I've had this happen before - can anyone explain why? – Trent Sep 18 '13 at 11:52

2 Answers2

1

One option is to aggregate the data outside ggplot and then use geom_bar. This will produce counts by month.

edited Sept. 21 2013. Altered plot to show months with no counts.

epicurve <- read.csv("epicurve.csv", sep=",", header=T)

# initial formatting
epicurve$onset <- as.Date(epicurve$onset, format="%d/%m/%Y")    # convert to Date class
epicurve$onset <- strftime(epicurve$onset, format="%Y/%m")      # convert to Year-month
epicurve$onset <- paste(epicurve$onset, "/01", sep = "")        # add arbitrary day on to end to make compatible w/ ggplot2

# aggregate by month
onset_counts <- aggregate(epicurve$onset, by = list(date = epicurve$onset), length) # aggregate by month
onset_counts$date = as.Date(onset_counts$date, format = "%Y/%m/%d") # covert to Date class

# plot
library(ggplot2)
library(scales)
ggplot(onset_counts, aes(x=date, y=x)) + geom_bar(stat="identity") + theme_bw() + theme(axis.text.x = element_text(angle=90, hjust = 1, vjust = 1)) +
    ylab("Frequency") + xlab(NULL) + scale_x_date(breaks="month", labels=date_format("%Y-%m"))
Nate Pope
  • 1,696
  • 12
  • 11
  • That has the right look, but doesn't include the months with a zero count - which is important for an epidemic curve. – Trent Sep 21 '13 at 11:46
  • @Trent I see. I have altered the code in my response to show months with no counts. – Nate Pope Sep 21 '13 at 17:45
1

I've also just happened across another way of making it look pretty, although it feels like a bit of a kludge.

#read data
epicurve <- read.csv("epicurve.csv", sep=",", header=T)
epicurve$onset <- as.Date(epicurve$onset, format="%d/%m/%Y")

#load libraries
library(ggplot2)
library(scales)

#plot
ggplot(epicurve, aes(onset)) + geom_histogram(colour="white", binwidth=30.4375) +
 scale_x_date(breaks=date_breaks("1 year"), minor_breaks=("1 month"), labels=date_format("%b-%Y")) + 
 scale_y_continuous(breaks=0:10, minor_breaks=NULL) +
 theme(axis.text.x = element_text(angle=45, vjust=0.5))
# binwidth = (365.25/12) = 30.4375 - which nicely makes the bins fit the scale nicely

Which gives this (notice the beautiful alignment of the bins!): fixed epicurve

Many thanks to Nate for the help, and hopefully this will be useful!

Trent
  • 55
  • 1
  • 8