32

I'm looking for an easier way to draw the cumulative distribution line in ggplot.

I have some data whose histogram I can immediately display with

qplot (mydata, binwidth=1);

I found a way to do it at http://www.r-tutor.com/elementary-statistics/quantitative-data/cumulative-frequency-graph but it involves several steps and when exploring data it's time consuming.

Is there a way to do it in a more straightforward way in ggplot, similar to how trend lines and confidence intervals can be added by specifying options?

Michael Currie
  • 13,721
  • 9
  • 42
  • 58
wishihadabettername
  • 14,231
  • 21
  • 68
  • 85

3 Answers3

62

The new version of ggplot2 (0.9.2.1) has a built-in stat_ecdf() function which let's you plot cumulative distributions very easily.

qplot(rnorm(1000), stat = "ecdf", geom = "step")

Or

df <- data.frame(x = c(rnorm(100, 0, 3), rnorm(100, 0, 10)),
             g = gl(2, 100))
ggplot(df, aes(x, colour = g)) + stat_ecdf()

Code samples from ggplot2 documentation.

Chris
  • 1,479
  • 2
  • 15
  • 19
29

There is a built in ecdf() function in R which should make things easier. Here's some sample code, utilizing plyr

library(plyr)
data(iris)

## Ecdf over all species
iris.all <- summarize(iris, Sepal.Length = unique(Sepal.Length), 
                            ecdf = ecdf(Sepal.Length)(unique(Sepal.Length)))

ggplot(iris.all, aes(Sepal.Length, ecdf)) + geom_step()

#Ecdf within species
iris.species <- ddply(iris, .(Species), summarize,
                            Sepal.Length = unique(Sepal.Length),
                            ecdf = ecdf(Sepal.Length)(unique(Sepal.Length)))

ggplot(iris.species, aes(Sepal.Length, ecdf, color = Species)) + geom_step()

Edit I just realized that you want cumulative frequency. You can get that by multiplying the ecdf value by the total number of observations:

iris.all <- summarize(iris, Sepal.Length = unique(Sepal.Length), 
                            ecdf = ecdf(Sepal.Length)(unique(Sepal.Length)) * length(Sepal.Length))

iris.species <- ddply(iris, .(Species), summarize,
                            Sepal.Length = unique(Sepal.Length),
                            ecdf = ecdf(Sepal.Length)(unique(Sepal.Length))*length(Sepal.Length))
JoFrhwld
  • 8,867
  • 4
  • 37
  • 32
  • 1
    This is a great answer, but there's one thing I can't quite figure out. In the `ecdf(Sepal.Length)(unique(Sepal.Length))` bit, what's happening? I understand that it's extracting concrete values from the `ecdf` object, but I don't remember ever seeing that (x)(y) notation before... can you help me understand that? Thanks! – Matt Parker Aug 30 '11 at 15:34
  • 4
    @MattParker `ecdf()` returns a function so that notation is evaluating the returned function at the unique values of `Sepal.Length`. – Gavin Simpson Nov 08 '11 at 16:16
21

Even easier:

qplot(unique(mydata), ecdf(mydata)(unique(mydata))*length(mydata), geom='step')
Yang
  • 16,037
  • 15
  • 100
  • 142