1

I have a set of times that I would like to plot on a histogram. Toy example:

df <- data.frame(time = c(1,2,2,3,4,5,5,5,6,7,7,7,9,9, ">10"))

The problem is that one value is ">10" and refers to the number of times that more than 10 seconds were observed. The other time points are all numbers referring to the actual time. Now, I would like to create a histogram that treats all numbers as numeric and combines them in bins when appropriate, while plotting the counts of the ">10" at the side of the distribution, but not in a separate plot. I have tried to call geom_histogram twice, once with the continuous data and once with the discrete data in a separate column but that gives me the following error:

Error: Discrete value supplied to continuous scale

Happy to hear suggestions!

Felix
  • 100
  • 13
  • 1
    Do you need `df %>% count(time) %>% ggplot(aes(x = time, y = n)) + geom_col()` – akrun Jul 07 '20 at 19:34
  • But I would like use bins as in a normal histogram. Or do you suggest to merge the bins into one value beforehand and then plotting geom_col like this? Sounds doable, but relatively impractical. Is there another way to still take advantage of geom_histogram? – Felix Jul 07 '20 at 19:40
  • You may replace the character value with a numeric value, convert it to numeric. But, it is not entirely clear to me – akrun Jul 07 '20 at 19:41
  • Thanks for the response! Let's say I convert >10 to numeric: Then it would arbitrarily be part of the last bin in the histogram, which I don't want. To say it in other words: I would like to have a histogram with all values 0-10 at binwidth 2. On the same plot, right next to that, I would like to add one barplot that displays the count of values >10. – Felix Jul 07 '20 at 19:52

2 Answers2

2

Perhaps, this is what you are looking for:

df1 <- data.frame(x=sample(1:12,50,rep=T))

df2 <- df1 %>%  group_by(x) %>% 
        dplyr::summarise(y=n()) %>% subset(x<11)

df3 <- subset(df1, x>10) %>% dplyr::summarise(y=n()) %>% mutate(x=11)

df <- rbind(df2,df3 )
label <- ifelse((df$x<11),as.character(df$x),">10")
  
p <- ggplot(df, aes(x=x,y=y,color=x,fill=x)) + 
  geom_bar(stat="identity", position = "dodge") +
  scale_x_continuous(breaks=df$x,labels=label) 
p

and you get the following output:

output

Please note that sometimes you could have some of the bars missing depending on the sample.

YBS
  • 19,324
  • 2
  • 9
  • 27
  • Thank you @YBS, that is almost the implementation that I planned to do. A few changed that I made in the end: (1) Define >10 as '11' to plot it on a continuous scale. (2) Use geom_histogram. (3) Rename the label '11' to '>10' with scale_x_continuous. Obviously, one has to be careful when assigning the value in (1), but it works very well in that particular use-case. – Felix Jul 08 '20 at 14:03
2

Here's a kind of involved solution, but I believe it best answers your question, which is that you are desiring to place next to typical histogram plot a bar representing the ">10" values (or the values which are non-numeric). Critically, you want to ensure that you maintain the "binning" associated with a histogram plot, which means you are not looking to simply make your scale a discrete scale and represent a histogram with a typical barplot.

The Data

Since you want to retain histogram features, I'm going to use an example dataset that is a bit more involved than that you gave us. I'm just going to specify a uniform distribution (n=100) with 20 ">10" values thrown in there.

set.seed(123)
df<- data.frame(time=c(runif(100,0,10), rep(">10",20)))

As prepared, df$time is a character vector, but for a histogram, we need that to be numeric. We're simply going to force it to be numeric and accept that the ">10" values are going to be coerced to be NAs. This is fine, since in the end we're just going to count up those NA values and represent them with a bar. While I'm at it, I'm creating a subset of df that will be used for creating the bar representing our NAs (">10") using the count() function, which returns a dataframe consisting of one row and column: df$n = 20 in this case.

library(dplyr)
df$time <- as.numeric(df$time)  #force numeric and get NA for everything else
df_na <- count(subset(df, is.na(time)))

The Plot(s)

For the actual plot, you are asking to create a combination of (1) a histogram, and (2) a barplot. These are not the same plot, but more importantly, they cannot share the same axis, since by definition, the histogram needs a continuous axis and "NA" values or ">10" is not a numeric/continuous value. The solution here is to make two separate plots, then combine them with a bit of magic thanks to cowplot.

The histogram is created quite easily. I'm saving the number of bins for demonstration purposes later. Here's the basic plot:

bin_num <- 12  # using this later

p1 <- ggplot(df, aes(x=time)) + theme_classic() +
  geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)

enter image description here

Thanks to the subsetting previously, the barplot for the NA values is easy too:

p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
  geom_col(color='gray25', fill='red', alpha=0.3)

enter image description here

Yikes! That looks horrible, but have patience.

Stitching them together

You can simply run plot_grid(p1, p2) and you get something workable... but it leaves quite a lot to be desired:

enter image description here

There are problems here. I'll enumerate them, then show you the final code for how I address them:

  1. Need to remove some elements from the NA barplot. Namely, the y axis entirely and the title for x axis (but it can't be NULL or the x axes won't line up properly). These are theme() elements that are easily removed via ggplot.

  2. The NA barplot is taking up WAY too much room. Need to cut the width down. We address this by accessing the rel_widths= argument of plot_grid(). Easy peasy.

  3. How do we know how to set the y scale upper limit? This is a bit more involved, since it will depend on the ..count.. stat for p1 as well as the numer of NA values. You can access the maximum count for a histogram using ggplot_build(), which is a part of ggplot2.

So, the final code requires the creation of the basic p1 and p2 plots, then adds to them in order to fix the limits. I'm also adding an annotation for number of bins to p1 so that we can track how well the upper limit setting works. Here's the code and some example plots where bin_num is set at 12 and 5, respectively:

# basic plots
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
  geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)

p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
  geom_col(color='gray25', fill='red', alpha=0.3) +
  labs(x="") +  theme(axis.line.y=element_blank(), axis.text.y=element_blank(),
    axis.title.y=element_blank(), axis.ticks.y=element_blank()
  ) +
  scale_x_discrete(expand=expansion(add=1))

#set upper y scale limit
max_count <- max(c(max(ggplot_build(p1)$data[[1]]$count), df_na$n))

# fix limits for plots
p1 <- p1 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15))) +
  annotate('text', x=0, y=max_count, label=paste('Bins:', bin_num))  # for demo purposes
p2 <- p2 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15)))

plot_grid(p1, p2, rel_widths=c(1,0.2))

enter image description here

enter image description here

So, our upper limit fixing works. You can get really crazy playing around with positioning, etc and the plot_grid() function, but I think it works pretty well this way.

chemdork123
  • 12,369
  • 2
  • 16
  • 32
  • Thank you for the great answer. Accepted this answer, since it is more generalized and solves more cases; In the end, I decided to use a modified implementation of @YBS ' answer. – Felix Jul 08 '20 at 14:08
  • 1
    That is definitely simpler, but if you had the case where you needed to use the binning provided by `stat_bin`, then you'd have to resort to something like this. – chemdork123 Jul 08 '20 at 14:40