3

Trying to plot a stacked histogram using ggplot:

set.seed(1)
my.df <- data.frame(param = runif(10000,0,1), 
                    x = runif(10000,0.5,1))
my.df$param.range <- cut(my.df$param, breaks = 5)

require(ggplot2)

not logging the y-axis:

ggplot(my.df,aes_string(x = "x", fill = "param.range")) + 
    geom_histogram(binwidth = 0.1, pad = TRUE) + 
    scale_fill_grey()

gives: enter image description here

But I want to log10+1 transform the y-axis to make it easier to read:

ggplot(my.df, aes_string(x = "x", y = "..count..+1", fill = "param.range")) + 
    geom_histogram(binwidth = 0.1, pad = TRUE) + 
    scale_fill_grey() + 
    scale_y_log10()

which gives:

enter image description here

The tick marks on the y-axis don't make sense.

I get the same behavior if I log10 transform rather than log10+1:

ggplot(my.df, aes_string(x = "x", fill = "param.range")) + 
    geom_histogram(binwidth = 0.1, pad = TRUE) + 
    scale_fill_grey() + 
    scale_y_log10()

Any idea what is going on?

alistaire
  • 42,459
  • 4
  • 77
  • 117
dan
  • 6,048
  • 10
  • 57
  • 125
  • 1. Why don't the tick marks make sense to you? 2. I don't see any transformation in the last line of code. – Hack-R Nov 14 '16 at 02:29
  • Sorry about the last line of code - corrected. About the y-axis tick values, I thought it's supposed to show the log10 of the counts shown in the first figure so they're supposed to be: 2.69, 3, 3.17, 3.30 rather than 1,000. 10,000,000, 100,000,000,000 – dan Nov 14 '16 at 02:35
  • The y-axis will still be denominated in the actual counts, not the logs of those counts, but the y-scale is transformed so that the physical distance for each factor of 10 is the same. – eipi10 Nov 14 '16 at 02:57
  • 1
    The mystery is why you're getting counts of 10^13 for each histogram bar when you have only 10,000 total data points. If you add `position="identity"` or `position="dodge"` to `geom_histogram` or if you add `+ facet_wrap(~ param.range)` or if you get rid of the fill aesthetic, then the counts are correct. But for some reason, the default stacked histograms give nonsensical counts with `scale_y_log10`. – eipi10 Nov 14 '16 at 02:58
  • Yes, eipi10, that's exactly what's going on – dan Nov 14 '16 at 03:02

1 Answers1

4

It looks like invoking scale_y_log10 with a stacked histogram is causing ggplot to plot the product of the counts for each component of the stack within each x bin. Below is a demonstration. We create a data frame called product.of.counts that contains the product, within each x bin of the counts for each param.range bin. We use geom_text to add those values to the plot and see that they coincide with the top of each stack of histogram bars.

At first I thought this was a bug, but after a bit of searching, I was reminded of the way ggplot does the log transformation. As described in the linked answer, "scale_y_log10 makes the counts, converts them to logs, stacks those logs, and then displays the scale in the anti-log form. Stacking logs, however, is not a linear transformation, so what you have asked it to do does not make any sense."

As a simpler example, say each of five components of a stacked bar have a count of 100. Then log10(100) = 2 for all five and the sum of the logs will be 10. Then ggplot takes the anti-log for the scale, which gives 10^10 for the total height of the bar (which is 100^5), even though the actual height is 100x5=500. This is exactly what's happening with your plot.

library(dplyr)
library(ggplot2)

# Data
set.seed(1)
my.df <- data.frame(param=runif(10000,0,1),x=runif(10000,0.5,1))
my.df$param.range <- cut(my.df$param,breaks=5)

# Calculate product of counts within each x bin
product.of.counts = my.df %>% 
  group_by(param.range, breaks=cut(x, breaks=seq(-0.05, 1.05, 0.1), labels=seq(0,1,0.1))) %>%
  tally %>%
  group_by(breaks) %>% 
  summarise(prod = prod(n),
            param.range=NA) %>%
  ungroup %>%
  mutate(breaks = as.numeric(as.character(breaks)))

ggplot(my.df, aes(x, fill=param.range)) + 
  geom_histogram(binwidth = 0.1, colour="grey30") + 
  scale_fill_grey() + 
  scale_y_log10(breaks=10^(0:14)) +
  geom_text(data=product.of.counts, size=3.5, 
            aes(x=breaks, y=prod, label=format(prod, scientific=TRUE, digits=3)))

enter image description here

Community
  • 1
  • 1
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • while the above answer does an admirable job of explaining why the desired behavior is not achieved, if you want suggestions for getting the desired behavior, go to the question @eipi10 links: https://stackoverflow.com/a/9507037/496488 – flies Aug 29 '19 at 13:59