0

I have a ggplot histogram of a random normal variable bob_commute_duration_minutes, with mean 30 and SD 10 (part of a Shiny app, if it matters). I'd like to limit the visual of the histogram to +/- 3SD. I have this:

outdf_r() %>% ggplot(aes(x = bob_commute_duration_minutes,
             fill= dplyr::if_else(bob_commute_duration_minutes>=input$commutemean,"Above Average","Below Average"))) +
  geom_histogram(bins=100) + 
 # ggtitle(paste( mean(outdf_r()$bob_commute_duration_minutes), sd(outdf_r()$bob_commute_duration_minutes),sep=","))+
  ggtitle("Bob's Commute Time Histogram (2SD Marked, 3SD Clipped)") +
  guides(fill=guide_legend(title="Commute Above or Below Average"))+
  xlab("Bob's Commute Time")+
  xlim((input$commutemean- 3*input$commutesd),(input$commutemean + 3*input$commutesd)) +
  geom_vline(xintercept=(input$commutemean- 2*input$commutesd),color="black") +
  geom_vline(xintercept=(input$commutemean+ 2*input$commutesd),color="black") 

enter image description here

So far so good; the extreme outliers aren't displayed. I'd also like to edit the x-axis so there are more tick marks. I tried using scale_x_continuous , with the limits of the breaks being +/- 3SD and tick marks at units of 1, but that caused the outliers to be displayed again. Any suggestions on how to make more detailed tick marks on the x-axis while still hiding values more than 3SD away from the mean?

Ralph Asher
  • 192
  • 9
  • Running `scale_x_continuous` after `xlim` overrides the call to `xlim`. So, remove the `xlim` call and use the `limits` argument in `scale_x_continuous` to set the limits. Also, bear in mind that setting limits with `scale_x_continuous` (or, equivalently, with `xlim`) excludes data outside the limits from calculations of means, regression lines, or any other summary measures. To set limits without excluding data, use `coord_cartesian`. See [here](https://stackoverflow.com/a/32506068/496488) for additional details. – eipi10 Jun 28 '21 at 19:28
  • That did it, thanks @eipi10 – Ralph Asher Jun 28 '21 at 19:53

1 Answers1

0

Here's an approach that defines the breaks and labels in scale_x_continuous and defines the visible range in coord_cartesian.

library(dplyr); library(ggplot2)

Data prep

outdf_r <- data.frame(bob_commute_duration_minutes = rnorm(2E4, 30, 10))
input <- outdf_r %>% summarize(commutemean = mean(bob_commute_duration_minutes),
                               commutesd = sd(bob_commute_duration_minutes))
x_range <- input$commutemean + input$commutesd * c(-3,3)

Function to label only the multiples of 5

label_5s <- function() {
  function(x) if_else(x %% 5 == 0, format(x), "")
}

Code

outdf_r %>% ggplot(aes(x = bob_commute_duration_minutes,
                         fill= dplyr::if_else(bob_commute_duration_minutes>=input$commutemean,"Above Average","Below Average"))) +
  geom_histogram(bins=100) + 
  ggtitle("Bob's Commute Time Histogram (2SD Marked, 3SD Clipped)") +
  guides(fill=guide_legend(title="Commute Above or Below Average"))+
  scale_x_continuous(breaks = 0:60, minor_breaks = 0, labels = label_5s()) +
  xlab("Bob's Commute Time")+
  coord_cartesian(xlim = x_range, expand = 0) +
  geom_vline(xintercept=(input$commutemean- 2*input$commutesd),color="black") +
  geom_vline(xintercept=(input$commutemean+ 2*input$commutesd),color="black") 

enter image description here

Jon Spring
  • 55,165
  • 4
  • 35
  • 53