0

The Ask:

Please help me understand my conceptual error in the use of scale_x_binned() in ggplot2 as it relates to centering breaks beneath the appropriate bin in a geom_histogram().

Starting Example:

library(ggplot2)

df <- data.frame(hour = sample(seq(0,23), 150, replace = TRUE))

# The data is just the integer values of the 24-hour clock in a day.  It is 
#   **NOT** continuous data.

ggplot(df, aes(x = hour)) +
  geom_histogram(bins = 24, fill = "grey60", color = "red")

This produces a histogram with labels properly centered beneath the bin for which it belongs, but I want to label each hour, 0 - 23.

To do that, I thought I would assign breaks using scale_x_binned() as demonstrated below.

Now I try to add the breaks:

ggplot(df, aes(x = hour)) +
  geom_histogram(bins = 24, fill = "grey60", color = "red") +
  scale_x_binned(name = "Hour of Day",
               breaks = seq(0,23))
#> Warning: Removed 1 rows containing missing values (`geom_bar()`).

This returns the number of labels I wanted, but they are not centered beneath the bins as desired. I also get the warning message for missing values associated with geom_bar().

I believe I am overwriting the bins = 24 from the geom_histogram() call when I use the scale_x_binned() call afterward, but I don't understand exactly what is causing geom_histogram() to be centered in the first case that I am wrecking with my new call. I'd really like to have that clarified as I am not seeing my error when I read the associated help pages.

EDIT:

The "Starting Example" essentially works (bins are centered) except for the number of labels I ultimately want. If you built the ggplot2 layer differently, what is the equivalent code? That is, instead of:

ggplot(df, aes(x = hour)) +
  geom_histogram(bins = 24, fill = "grey60", color = "red")

the call was instead built something like:

ggplot(df, aes(x = hour)) +
  geom_histogram(fill = "grey60", color = "red") +
  scale_x_binned(n.breaks = 24)  # I know this isn't right, but akin to this.

or maybe

ggplot(df, aes(x = hour)) +
   stat_bin(bins = 24, center = 0, fill = "grey60", color = "red")
ScottyJ
  • 945
  • 11
  • 16
  • So why are you not adding 0.5 to the breaks values? – IRTFM Nov 21 '22 at 23:25
  • @JonSpring I am literally using integers from 0-23. It's a histogram of 0-23. I'm not sure I follow that it matters for actual time. – ScottyJ Nov 21 '22 at 23:25
  • The default is that the breaks are the labeled boundaries of the bins. In decimal time, I presume you want the first bin to be 0 (midnight) to 0.99 (12:59am), centered at 0.5 (12:30am), with a label 0? The simplest approach might be to add an `theme(axis.text.x = element_text(hjust = -0.5))` but the text alignment there is tied to the left edge of the bin and not the midpoint, so it won't be perfect. – Jon Spring Nov 21 '22 at 23:31
  • @IRTFM is the adding of 0.5 what is happening to center the bins in the original case ("Starting Example") in my OP? I am really trying to understand what the equivalent call would be for that first case, and then extend it so all hours are displayed. I'll play with it. – ScottyJ Nov 22 '22 at 00:55

2 Answers2

1

It sounds like you are looking to use non-default labeling, where you want the labels to be aligned to the midpoint of the bins instead of their boundaries, which is what the breaks define. We could do that by using a continuous scale and hiding the main breaks, but keeping the minor breaks, like below.

scale_x_binned does not have minor breaks. It only has breaks at the boundaries of the bins, so it's not obvious to me how you could place the break labels at the midpoints of the bins.

ggplot(df, aes(x = hour)) +
  geom_histogram(bins = 24, fill = "grey60", color = "red") +
  scale_x_continuous(name = "Hour of Day", breaks = 0:23) +
  theme(axis.ticks = element_blank(),
        panel.grid.major.x = element_blank())

enter image description here

Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Actually, I am mostly trying to replicate / understand the default labeling. The first example in my question **DOES** have the bins centered over the labels by default, though it doesn't have one for each hour of the day as I'm ultimately looking to do (again, they're integers that represent hours). I'm trying to understand what the equivalent settings are under that `geom_histogram()` call that put the major breaks at `c(0,5,10,15,20)` and centers them. Is `geom_histogram()` setting a `scale_x_continuous()` as in your answer? (cont'd) – ScottyJ Nov 22 '22 at 00:50
  • (cont'd) That doesn't seem like what the default is doing, and I was thinking `scale_x_binned()` or maybe `scale_x_discrete()` was the appropriate function. – ScottyJ Nov 22 '22 at 00:51
1

I though the same as you, namely scale_x_discrete, but the data given to geom_histogram is assumed to be continuous, so ...

ggplot(df, aes(x = hour)) +
   geom_histogram(bins = 24, fill = "grey60", color = "red") + 
   scale_x_continuous(breaks = 0:23)

(Doesn't require any machinations with theme.)

enter image description here

I wish I could tell you that I found out how geom_histogram is centering the labels, but ggproto objects exist in a cavern with too many tunnels and passages for my mind to follow.

So I took a shot at examining the plot object that I created when I produced the png graphic above:

ggplot_build(plt)
# ------------
$data
$data[[1]]
    y count  x xmin xmax    density ncount ndensity flipped_aes PANEL group ymin ymax colour   fill size linetype
1   6     6  0 -0.5  0.5 0.04000000    0.6      0.6       FALSE     1    -1    0    6    red grey60  0.5        1
2   7     7  1  0.5  1.5 0.04666667    0.7      0.7       FALSE     1    -1    0    7    red grey60  0.5        1
3   4     4  2  1.5  2.5 0.02666667    0.4      0.4       FALSE     1    -1    0    4    red grey60  0.5        1
4   5     5  3  2.5  3.5 0.03333333    0.5      0.5       FALSE     1    -1    0    5    red grey60  0.5        1
5   7     7  4  3.5  4.5 0.04666667    0.7      0.7       FALSE     1    -1    0    7    red grey60  0.5        1
#snipped remainder

So the reason the break tick-marks are centered is that the bin construction is set up so they all are centered on the breaks.

Further exploration f whats in ggplot_build results:

ls(envir=ggplot_build(plt)$layout)
#[1] "coord"          "coord_params"   "facet"          "facet_params"   "layout"         "panel_params"  
#[7] "panel_scales_x" "panel_scales_y" "super"  

ggplot_build(plt)$layout$panel_params
#-------results
[[1]]
[[1]]$x
<ggproto object: Class ViewScale, gg>
    aesthetics: x xmin xmax xend xintercept xmin_final xmax_final xlower ...
    break_positions: function
    break_positions_minor: function
    breaks: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  ...
    continuous_range: -1.7 24.7
    dimension: function
    get_breaks: function
    get_breaks_minor: function
#---- snipped remaining outpu
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • You're right, this works, and I now see the help page for `geom_histogram()` **DOES** say that this is for _continuous_ variables. Also, in the 'Details' section, it says you can replicate histogram functionality using the combination of `scale_x_binned()` with `geom_bar()` which I'm toying with now (but not getting a good result). I still don't feel I fully grasp it though... – ScottyJ Nov 22 '22 at 03:03
  • Oh, and this edit you just posted is **GOLD**. I didn't know about `ggplot_build()`. Thank you for this! – ScottyJ Nov 22 '22 at 03:07
  • I found a bit more with `ggplot_build`. See next edit. – IRTFM Nov 22 '22 at 03:12