2

I'd much appreciate anyone's help to resolve this question please. It seems like it should be so simple, but after many hours experimenting, I've had to stop in and ask for help. Thank you very much in advance!

Summary of question:

How can one ensure in ggplot2 the y-axis of a histogram is labelled using only integers (frequency count values) and not decimals?

The functions, arguments and datatype changes tried so far include:

  • geom_histogram(), geom_bar() and geom(col) - in each case, including, or not, the argument stat = "identity" where relevant.
  • adding + scale_y_discrete(), with or without + scale_x_discrete()
  • converting the underlying count data to a factor and/or the bin data to a factor

Ideally, the solution would be using baseR or ggplot2, instead of additional external dependencies e.g. by using the function pretty_breaks() func in the scales package, or similar.

Sample data:

sample <- data.frame(binMidPts = c(4500,5500,6500,7500), counts = c(8,0,9,3))

The x-axis consists of bins of a continuous variable, and the y-axis is intended to show the count of observations in those bins. For example, Bin 1 covers the x-axis range [4000 <= x < 5000], has a mid-point 4500, with 8 data points observed in that bin / range.

Code that almost works:

The following code generates a graph similar to the one I'm seeking, however the y-axis is labelled with decimal values on the breaks (which aren't valid as the data are integer count values).

ggplot(data = sample, aes (x = binMidPts, y = counts)) + geom_col()

Graph produced by this code is: Simple geom_col plot with "incorrect" continuous y-axis

I realise I could hard-code the breaks / labels onto a scale_y_continuous() axis but (a) I'd prefer a flexible solution to apply to many differently sized datasets where the scale isn't know in advance, and (b) I expect there must be a simpler way to generate a basic histogram.

References

I've consulted many Stack Overflow questions, the ggplot2 manual (https://ggplot2.tidyverse.org/reference/scale_discrete.html), the sthda.com examples and various blogs. These tend to address related problems, e.g. using scale_y_continuous, or where count data is not available in the underlying dataset and thus rely on stat_bin() for a transformation.

Any help would be much appreciated! Thank you.

// Update 1 - Extending scale to zero

Future readers of this thread may find it helpful to know that the range of break values formed by base::pretty() does not necessarily extend to zero. Thus, the axis scale may omit values between zero and the lower range of the breaks, as shown here: y axis breaks omitted below the lower range of pretty()

To resolve this, I included '0' in the range() parameter, i.e.:

ggplot(data = sample, aes (x = binMidPts, y = counts)) + geom_col() +
    scale_y_continuous(breaks=round(pretty(range(0,sample$counts))))

which gives the desired full scale on the y-axis, thus:

y axis scale extends to zero

  • There doesn’t seem to be a good solution, short of providing manual breaks, since ‘ggplot2’ does not handle integer variables differently from numeric (floating point) variables. – Konrad Rudolph Apr 28 '21 at 07:50

3 Answers3

4

How about:


ggplot(data = sample, aes (x = binMidPts, y = counts)) + geom_col() +
    scale_y_continuous( breaks=round(pretty( range(sample$counts) )) )

enter image description here

This answer suggests pretty_breaks from the scales package. The manual page of pretty_breaks mentions pretty from base. And from there you just have to round it to the nearest integer.

Sirius
  • 5,224
  • 2
  • 14
  • 21
1

The default y-axis breaks is calculated with scales::extended_breaks(). This function factory has a ... argument that passes on arguments to labeling::extended, which has a Q argument for what it considers 'nice numbers'. If you omit the 2.5 from the default, you should get integer breaks when the range is 3 or larger.

library(ggplot2)
library(scales)

sample <- data.frame(binMidPts = c(4500,5500,6500,7500), counts = c(8,0,9,3))

ggplot(data = sample, aes (x = binMidPts, y = counts)) + 
  geom_col() +
  scale_y_continuous(
    breaks = extended_breaks(Q = c(1, 5, 2, 4, 3))
  )

Created on 2021-04-28 by the reprex package (v1.0.0)

teunbrand
  • 33,645
  • 4
  • 37
  • 63
0

Or you can calculate the breaks with some rules customized to the dataset you are working like this

library(ggplot2)

breaks_min <- 0
breaks_max <- max(sample[["counts"]])
# Assume 5 breaks is perferable
breaks_bin <- round((breaks_max - breaks_min) / 5)
custom_breaks <- seq(breaks_min, breaks_max, breaks_bin)

ggplot(data = sample, aes (x = binMidPts, y = counts)) + 
  geom_col() +
  scale_y_continuous(breaks = custom_breaks, expand = c(0, 0))

Created on 2021-04-28 by the reprex package (v2.0.0)

Sinh Nguyen
  • 4,277
  • 3
  • 18
  • 26
  • Hi Sinh Nguyen, thank you very much for your idea. I did consider that, but didn't want to be too strict with the number of breaks for all datasets (counts sometimes range from 0-10, in other cases 0-5000). A small range may suit a small no. of breaks, but with larger ranges a larger no. of breaks may be appropriate / necessary. Appreciate your help tho'. Thanks. – latitude_longitude Apr 28 '21 at 08:20