2

I am able to produce a cumulative histogram as below. Using mtcars as an example, each car has a certain number of carburetors. For each quantity of carburetors (x axis), the plot shows the number of cars with less than that many carburetors. I would like to produce a plot which shows the number of cars with greater than that many carburetors.

This link provides further explanation

Thanks!

mtcars %>% 
ggplot(aes(carb)) +
geom_histogram(aes(y = cumsum(..count..)))

Displayed plot

Dan Adams
  • 4,971
  • 9
  • 28
Mark
  • 119
  • 8

2 Answers2

2

stat_ecdf() is a good starting point for this visualization but there are a few modifications we need to make.

  1. In a CDF, y represents the probability density of values less than a given value of x. Since you're looking for the density of values greater than x, we can instead invert the output. For this we make use of the special internal variables computed by ggplot(). These used to be accessed through .. or stat() nomenclature (e.g. ..y.. or stat(y)). Now the preferred nomenclature is after_stat() (also described in this and this blog posts). So the final code specifies this inversion inside the aes() of stat_ecdf() by setting y = 1 - after_stat(y) meaning, "once you've calculated the y value with the stat, subtract that value from 1 before returning for plotting".
  2. You want to see actual count rather than probability density. For this, one easy option is to use a second axis where you specify this transformation by simply multiplying by the number of observations. To facilitate this I calculate this value outside of the ggplot() call because it's cumbersome to access this value within ggplot.
  3. Since you are asking for a value of y that is the count of observations with a value greater than or equal to x, we need to shift the default output of stat_ecdf(). Here, I do this by simply specifying aes(carb + 1). I show both versions below for comparison.

Note: I'm showing the points with the line to help illustrate the actual y value at each x since the geom = "step" (the default geom of stat_ecdf()) somewhat obscures it.

library(tidyverse)

n <- nrow(mtcars)

mtcars %>% 
  ggplot(aes(carb)) +
  stat_ecdf(aes(y = (1 - after_stat(y))), geom = "point") +
  stat_ecdf(aes(y = (1 - after_stat(y))), geom = "step") +
  scale_y_continuous("Density", position = "right",
                     sec.axis = sec_axis(name = "Count", trans = ~.x*n)) +
  scale_x_continuous(limits = c(0, NA), breaks = 0:8) +
  ggtitle("y = count with carb > x")


mtcars %>% 
  ggplot(aes(carb + 1)) +
  stat_ecdf(aes(y = (1 - after_stat(y))), geom = "point") +
  stat_ecdf(aes(y = (1 - after_stat(y))), geom = "step") +
  scale_y_continuous("Density", position = "right",
                     sec.axis = sec_axis(name = "Count", trans = ~.x*n)) +
  scale_x_continuous(limits = c(0, NA), breaks = 0:9) +
  ggtitle("y = count with carb >= x")

Created on 2022-09-30 by the reprex package (v2.0.1)

Dan Adams
  • 4,971
  • 9
  • 28
  • Thank you! This is what I was after. Can you please suggest what to edit to produce a plot such that the `y` value at any given `x` represents the number of observations **greater than or equal to** that value of `x`? Sorry if this is a very simple modification - tbh I'm not really sure what `..y..` means here and can't find much online. Do you know where I can read about this? – Mark Sep 29 '22 at 11:39
  • If you want "greater than or equal to" you probably need to shift all the values because `stat_ecdf` calculates just what's less than the value. The `..y..` is a kind of weird syntax. It refers to special variables that are internally computed by `ggplot` which you can still access and use. I don't know if there's formal documentation on it but it's described in [this](https://stackoverflow.com/questions/14570293/special-variables-in-ggplot-count-density-etc) post. – Dan Adams Sep 29 '22 at 12:20
  • I found an important mistake in the original answer. I have updated with a better version and more explanation of the `..y..` business. – Dan Adams Sep 30 '22 at 15:40
0

Like this?

library(tidyverse)
mtcars |> 
      select(disp) |>
      ggplot(aes(disp, y = 1 - ..y..)) +
      stat_ecdf()
Isaiah
  • 2,091
  • 3
  • 19
  • 28
  • 1
    Thanks for the suggestion! I think this is close but I'm trying to produce a plot where the y axis shows a cumulative count rather than a density. Also, ideally would like to show this using a histogram but if not possible then this would work as well! – Mark Sep 28 '22 at 09:14
  • 1
    FYI - no need to `select` just the variable you want to plot. That's what `aes()` is for. – Dan Adams Sep 29 '22 at 01:03