1

I am trying to make a histogram using ggplot, where over 95% of the data is 0 and the rest of it is between 1 - 55. I do not want to show the 0s on the histogram - but I do want them accounted for in the total percentage, that way the other %s remain low. I've taken two approaches for this -- but what happens is the percentages for the rest of the data get messed up and the 0s aren't included in the calculation.

My first approach was this:

set1 %>% filter(total>0)%>%
  ggplot(aes(x=total, fill=lowcost))+
  geom_histogram(binwidth=1,aes(y = (..count..)/sum(..count..)),col=I("black"))+
  scale_color_grey()+scale_fill_grey(start = .85,
                                     end = .85,) +
  theme_linedraw()+
  guides(fill = "none", cols='none')+
  geom_vline(aes(xintercept=10, size='Low target'),
             color="black", linetype=5)+
  geom_vline(aes(xintercept=50, size='High target'),
             color="black", linetype="dotted")+
  scale_size_manual(values = c(.5, 0.5), guide=guide_legend(title = "Target", override.aes = list(linetype=c(3,5), color=c('black', 'black'))))+
  scale_y_continuous(labels=scales::percent)+
  scale_x_continuous(breaks = c(seq(0,50,10), 55), labels = c(seq(0, 50, 10), '>55'), limits = c(0, 60))+
  facet_grid(cols = vars(lowcost))+
  ggtitle("Ask Set 1 ")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Total donation ($)")+
  ylab("Percent")

My second approach was not filtering out the 0s, but instead limiting the X axis to not include them, but this didn't work either:

set1 %>% 
  ggplot(aes(x=total, fill=lowcost))+
  geom_histogram(binwidth=1,aes(y = (..count..)/sum(..count..)),col=I("black"))+
  scale_color_grey()+scale_fill_grey(start = .85,
                                     end = .85,) +
  theme_linedraw()+
  guides(fill = "none", cols='none')+
  geom_vline(aes(xintercept=10, size='Low target'),
             color="black", linetype=5)+
  geom_vline(aes(xintercept=50, size='High target'),
             color="black", linetype="dotted")+
  scale_size_manual(values = c(.5, 0.5), guide=guide_legend(title = "Target", override.aes = list(linetype=c(3,5), color=c('black', 'black'))))+
  scale_y_continuous(labels=scales::percent)+
  scale_x_continuous(breaks = c(seq(0,50,10), 55), labels = c(seq(0, 50, 10), '>55'), limits = c(0.01, 60))+
  facet_grid(cols = vars(lowcost))+
  ggtitle("Ask Set 1 ")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Total donation ($)")+
  ylab("Percent")

Both result in histograms like look like this: The tallest bar on the left histogram should actually be 1.19%

enter image description here

The percents should be the following in the histogram on the left:

enter image description here

The percents should be the following in the histogram on the right:

enter image description here

  • (1) The use of vectors in `ggplot(..)` works but to me seems risky and does not enable "normal things". I strongly recommend that you keep data for plotting in a `data.frame`; this would make it `ggplot(mydata, aes(x=total, fill=lowcost))+...`. (2) Once there, you can just subset the data out of the plot, perhaps `ggplot(subset(mydata, total>0), aes(x=total, fill=lowcost))+...`. – r2evans Mar 24 '22 at 16:57
  • @r2evans thanks for pointing that out, the first line of my code got cut off, I just updated it. I tried that suggested code but it didn't fix the problem the the 0s – Erin Morrissey Mar 24 '22 at 17:03
  • Okay, that makes sense. Can you make this question [reproducible](https://stackoverflow.com/q/5963269) using `dput(.)` to provide representative data? You shouldn't share all 18k rows of course, but perhaps you can generate sample data randomly (or deterministically, I don't care) and use that for the question. – r2evans Mar 24 '22 at 17:04
  • unforunately, I'm not sure how to do that, my knowledge of R is quite limited (really scraped together this code) – Erin Morrissey Mar 24 '22 at 17:07
  • Okay, I've reread your question, I think my first comment was misguided. Since you cannot share your current data, perhaps try to do something simple and similar with (for example) `ggplot2::diamonds` or another dataset. – r2evans Mar 24 '22 at 17:15
  • Perhaps all you need to do is remove `filter(..)`, giving all data to `ggplot(..)`, but then add `... + coord_cartesian(xlim=c(1, NA))`. This is doing "clipping" of the plot, so all stats should be unchanged. – r2evans Mar 24 '22 at 17:18

2 Answers2

1

I think you can do what you want using "clipping" with coord_cartesian. Try this (untested):

set1 %>%
  # filter(total>0) %>%                   # comment this out, do not filter
  ggplot(aes(x=total, fill=lowcost)) +
  coord_cartesian(xlim = c(1, NA)) +      # start at 1, extend to the normal limit
  geom_histogram(binwidth=1, aes(y = (..count..)/sum(..count..)), col=I("black")) +
  ...                                     # rest unchanged
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Awesome this worked, thanks so much! It did raise another issue that I didn't realize was occurring in the existing code. The % calculated for the chart on the right and the left is the total from the entire dataset- whereas the left should show the %s out of only the data of one level of the low cost variable and the right side should show the %s of the other level of the "lowcost" variable. Any idea how to address this? – Erin Morrissey Mar 24 '22 at 18:31
  • If I understand correctly, you want to report the percentage per group instead of overall. I suggest that's a different question. When you post that, I strongly suggest you shift to a dataset where the question is fully reproducible. I know that's extra work for you, but it seems unlikely you'll be able to share `set1` with sufficient variability to demonstrate your code and issue. Also, in basic "how do I" questions, most of the theming of `ggplot2` plots is an unnecessary distraction; it really helps when the code is truly *minimal*. – r2evans Mar 24 '22 at 18:34
0

Perhaps try something like this:

# Test data + expected outcome
set1 <- tibble(total=c(rep(0,10), rep(1,5), rep(2,5)))
set1 %>% count(total) %>% mutate(percent = n/sum(n))

enter image description here

# First, count the percentage and store it in a temporary variable
# Then, use the percentage variable with "identity" option for the histogram
# You can then either filter out the total first, or change the limit
set1 %>% 
    count(total) %>% 
    mutate(percent = n/sum(n)) %>%
    filter(total>0) %>%
    ggplot(aes(x=total,y=percent)) + 
    geom_histogram(stat="identity") +
    scale_x_continuous(limits = c(0, 3)) +
    scale_y_continuous(labels=scales::percent) +
    ylab("Percent")

enter image description here