3

I try to visualise the difference between two histograms of distribution functions such as the difference in following two curves :

enter image description here

When the difference is big, you could just plot two curves on top of each other and fill the difference as denoted above, though when the difference becomes very small, this is cumbersome. Another way to plot this, is plotting the difference itself as follows :

enter image description here

However, this seems very hard to read for everyone seeing such a graph for the first time, so i was wondering: is there any other way you can visualise the difference between two distribution functions ?

ruben baetens
  • 2,806
  • 6
  • 25
  • 31
  • I think this is an interesting question, but it's too open-ended and opinion-based for SO. (And it's not really about programming, either.) Maybe it would be on-topic at Cross Validated? – Gregor Thomas Mar 31 '15 at 21:53
  • 1
    Just to make sure we're talking about the same things: You want to visualize probability density functions by considering the histograms of a realisation of said probability distributions, right? Because cumulative distribution functions are something quite different... – Eike P. Mar 31 '15 at 22:34
  • Example data sets would be nice. – Eike P. Mar 31 '15 at 22:35
  • @jhin is there a way to put example data on SO ? – ruben baetens Apr 01 '15 at 06:12
  • For small data sets you can always use `dput`, but for really large data sets I'm not aware of anything special. Maybe you could put it on [gist](https://gist.github.com/) (also using `dput`)? – Eike P. Apr 01 '15 at 11:08
  • I just realised that the second picture actually says "CDF differences". This is not in coherence with the first picture, which clearly does not show a CDF... – Eike P. Apr 01 '15 at 11:29

1 Answers1

3

I thought that maybe it might be an option to simply combine your two propositions, while scaling up the differences to make them visible.

What follows is an attempt to do this with ggplot2. Actually it was quite a bit more involved to do this than I initially thought, and I'm definitely not a hundred percent satisfied with the result; but maybe it helps nevertheless. Comments and improvements very welcome.

library(ggplot2)
library(dplyr)

## function that replicates default ggplot2 colors
## taken from [1]
gg_color_hue <- function(n) {
  hues = seq(15, 375, length=n+1)
  hcl(h=hues, l=65, c=100)[1:n]
}

## Set up sample data
set.seed(1)
n <- 2000
x1 <- rlnorm(n, 0, 1)
x2 <- rlnorm(n, 0, 1.1)
df <- bind_rows(data.frame(sample=1, x=x1), data.frame(sample=2, x=x2)) %>%
  mutate(sample = as.factor(sample))

## Calculate density estimates
g1 <- ggplot(df, aes(x=x, group=sample, colour=sample)) +
  geom_density(data = df) + xlim(0, 10)
gg1 <- ggplot_build(g1)

## Use these estimates (available at the same x coordinates!) for
## calculating the differences.
## Inspired by [2]
x <- gg1$data[[1]]$x[gg1$data[[1]]$group == 1]
y1 <- gg1$data[[1]]$y[gg1$data[[1]]$group == 1]
y2 <- gg1$data[[1]]$y[gg1$data[[1]]$group == 2]
df2 <- data.frame(x = x, ymin = pmin(y1, y2), ymax = pmax(y1, y2), 
                  side=(y1<y2), ydiff = y2-y1)
g2 <- ggplot(df2) +
   geom_ribbon(aes(x = x, ymin = ymin, ymax = ymax, fill = side, alpha = 0.5)) +
   geom_line(aes(x = x, y = 5 * abs(ydiff), colour = side)) +
   geom_area(aes(x = x, y = 5 * abs(ydiff), fill = side, alpha = 0.4))
g3 <- g2 + 
   geom_density(data = df, size = 1, aes(x = x, group = sample, colour = sample)) +
   xlim(0, 10) +
   guides(alpha = FALSE, colour = FALSE) +
   ylab("Curves: density\n Shaded area: 5 * difference of densities") +
   scale_fill_manual(name = "samples", labels = 1:2, values = gg_color_hue(2)) +
   scale_colour_manual(limits = list(1, 2, FALSE, TRUE), values = rep(gg_color_hue(2), 2))

print(g3)

enter image description here

Sources: SO answer 1, SO answer 2


As suggested by @Gregor in the comments, here's a version that does two separate plots below eachother but sharing the same x axis scaling. At least the legends should obviously be tweaked.

library(ggplot2)
library(dplyr)
library(grid)

## function that replicates default ggplot2 colors
## taken from [1]
gg_color_hue <- function(n) {
  hues = seq(15, 375, length=n+1)
  hcl(h=hues, l=65, c=100)[1:n]
}

## Set up sample data
set.seed(1)
n <- 2000
x1 <- rlnorm(n, 0, 1)
x2 <- rlnorm(n, 0, 1.1)
df <- bind_rows(data.frame(sample=1, x=x1), data.frame(sample=2, x=x2)) %>%
  mutate(sample = as.factor(sample))

## Calculate density estimates
g1 <- ggplot(df, aes(x=x, group=sample, colour=sample)) +
  geom_density(data = df) + xlim(0, 10)
gg1 <- ggplot_build(g1)

## Use these estimates (available at the same x coordinates!) for
## calculating the differences.
## Inspired by [2]
x <- gg1$data[[1]]$x[gg1$data[[1]]$group == 1]
y1 <- gg1$data[[1]]$y[gg1$data[[1]]$group == 1]
y2 <- gg1$data[[1]]$y[gg1$data[[1]]$group == 2]
df2 <- data.frame(x = x, ymin = pmin(y1, y2), ymax = pmax(y1, y2), 
                  side=(y1<y2), ydiff = y2-y1)
g2 <- ggplot(df2) +
   geom_ribbon(aes(x = x, ymin = ymin, ymax = ymax, fill = side, alpha = 0.5)) +
   geom_density(data = df, size = 1, aes(x = x, group = sample, colour = sample)) +
  xlim(0, 10) +
  guides(alpha = FALSE, fill = FALSE)
g3 <- ggplot(df2) +
   geom_line(aes(x = x, y = abs(ydiff), colour = side)) +
   geom_area(aes(x = x, y = abs(ydiff), fill = side, alpha = 0.4)) +
   guides(alpha = FALSE, fill = FALSE)
## See [3]
grid.draw(rbind(ggplotGrob(g2), ggplotGrob(g3), size="last"))

enter image description here

... or with abs(ydiff) replaced by ydiff in the construction of the second plot: enter image description here

Source: SO answer 3

Community
  • 1
  • 1
Eike P.
  • 3,333
  • 1
  • 27
  • 38
  • 1
    Two plots in a single column might be preferable since the y-scales are different. – Gregor Thomas Apr 01 '15 at 01:07
  • Yeah! Now you don't have to mess with the scale of the difference. – Gregor Thomas Apr 01 '15 at 02:04
  • You're probably right, this seems to be a cleaner solution! – Eike P. Apr 01 '15 at 02:06
  • Having both graphs below each other indeed shows much more information, thank you for this suggestion. If i to plot False / True with a different sign, e.g. false negative, it might be even better, as it allows to compare more than two cdf's ... – ruben baetens Apr 01 '15 at 06:15
  • (Sorry if I'm being stubborn, but again, these are not CDFs but rather PDFs!) – Eike P. Apr 01 '15 at 11:13
  • How exactly would you like to compare more than 2 PDFs? Plot pairwise differences? So for 3 PDFs this would be 3 diffs, for 4 PDFs 6 diffs, etc.? Or do you have something else in mind? – Eike P. Apr 01 '15 at 11:14
  • @jhin i meant pdf's of course, typo ... for my specific case i have n-1 diffs for n pdf's as i have 1 'reference' case to which i compare. – ruben baetens Apr 02 '15 at 12:08
  • @jhin i updated the title to pdf, my apologies for the confusion. – ruben baetens Apr 02 '15 at 12:10
  • @rubenbaetens thanks ;) And I see, comparing to a reference of course makes sense! – Eike P. Apr 02 '15 at 14:14
  • @rubenbaetens does this answer your question? If yes, you should accept it. If not, please explain why it doesn't so that somebody else can help you. – Eike P. Apr 06 '15 at 11:06