3

I would like to create a histogram plot comparing three groups. However, I'd like to normalize each histogram by the total number of counts within each group, not by the total number of counts. Here is the code that I have.

library(ggplot2)
library(reshape2)
# Creates dataset
set.seed(9)
df<- data.frame(values = c(runif(400,20,50),runif(300,40,80),runif(600,0,30)),labels = c(rep("med",400),rep("high",300),rep("low",600)))

levs <- c("low", "med", "high")
df$labels <- factor(df$labels, levels = levs)

ggplot(df, aes(x=values, fill=labels)) + 
    geom_histogram(aes(y=..density..), 
                   breaks= seq(0, 80, by = 2),
                   alpha=0.2, 
                   position="identity")

Which generates a histogram which appears to be normalized by density. enter image description here

However, I decided to cross check this density plot against my manual validation of that density. To do that I used the below code:

# Separates the low medium and high groups
df1 <- df[df$labels == "low",]
df2 <- df[df$labels == "med",]
df3 <- df[df$labels == "high",]

# creates histogram for each group that is normalized by the total number of counts
hist_temp <- hist(df1$values, breaks=seq(0,80, by=2))
    tdf <- data.frame(hist_temp$breaks[2:length(hist_temp$breaks)],hist_temp$counts)
    colnames(tdf) <- c("bins","counts")
    tdf$norm <- tdf$counts/(sum(tdf$counts))
        low1 <- tdf

hist_temp <- hist(df2$values, breaks=seq(0,80, by=2))
    tdf <- data.frame(hist_temp$breaks[2:length(hist_temp$breaks)],hist_temp$counts)
    colnames(tdf) <- c("bins","counts")
    tdf$norm <- tdf$counts/(sum(tdf$counts))
        med1 <- tdf

hist_temp <- hist(df3$values, breaks=seq(0,80, by=2))
    tdf <- data.frame(hist_temp$breaks[2:length(hist_temp$breaks)],hist_temp$counts)
    colnames(tdf) <- c("bins","counts")
    tdf$norm <- tdf$counts/(sum(tdf$counts))
        high1 <- tdf

# Combines normalized histograms for each data frame and melts them into a single vector for plotting
Tdata <- data.frame(low1$bins,low1$norm,med1$norm,high1$norm)
    colnames(Tdata) <- c("bin","low", "med", "high")
    Tdata<- melt(Tdata,id = "bin")

levs <- c("low", "med", "high")
Tdata$variable <- factor(Tdata$variable, levels = levs)

# Plot the data
ggplot(Tdata, aes(group=variable, colour= variable)) + 
    geom_line(aes(x = bin, y = value))

Which generates: enter image description here

As you can see those are quite different from each other and I can't figure out why. The Y axis should be the same for both of them but it's not. So, assuming I didn't do some stupid math error, I believe I want the histogram to look like the line plot and I can't figure out a way to make that happen. Any help is appreciated and thank you in advance.


Edited to add further examples of what doesn't work:

I have also tried using the ..count../(sum(..count..)) approach with this code:

# Histogram where each histogram is divided by the total count of all groups    
    ggplot(df, aes(x=values, fill=labels, group=labels)) + 
        geom_histogram(aes(y=(..count../sum(..count..))), 
                       breaks= seq(0, 80, by = 2),
                       alpha=0.2, 
                       position="identity")

with these results: enter image description here

Which just normalizes to the total count of all histograms. This also does not reflect what I see in the line plot. Also, I've tried substituting ..ncount.. for ..count.. (in the numerator, denominator, and numerator and denominator) and that also does not recreate the results shown in the line graph.

Additionally, I've tried using "position=stack" rather than identity using the below code:

    ggplot(df, aes(x=values, fill=labels, group=labels)) + 
        geom_histogram(aes(y=..density..), 
                       breaks= seq(0, 80, by = 2),
                       alpha=0.2, 
                       position="stack")

and got this result: enter image description here

Which also does not reflect the values shown in the line graph.


Progress made! Using the approach outlined at this post by Joran I can now generate the histogram that is the same as the line graph. Below is the code:

# Plot where each histogram is normalized by its own counts.  
ggplot(df,aes(x=values, fill=labels, group=labels)) + 
    geom_histogram(data=subset(df, labels == 'high'),
                   aes(y=(..count../sum(..count..))), 
                   breaks= seq(0, 80, by = 2),
                   alpha = 0.2) + 
    geom_histogram(data=subset(df, labels == 'med'),
                   aes(y=(..count../sum(..count..))), 
                   breaks= seq(0, 80, by = 2),
                   alpha = 0.2) +
    geom_histogram(data=subset(df, labels == 'low'),
                   aes(y=(..count../sum(..count..))), 
                   breaks= seq(0, 80, by = 2),
                   alpha = 0.2) +
    scale_fill_manual(values = c("blue","red","green"))

Which produces this graph: enter image description here

However, I am STILL having trouble re-ordering the data so that the legend reads "low" then "med" then "high", instead of alphabetical order. I've already set the levels of the factors. (See first block of code). Any thoughts?

Nathan
  • 323
  • 3
  • 13
  • In the first chunk you used y = ..density.., which I am guessing accounts for the probability density. Try to add group = labels and use y = ..count../sum(..count..) instead. – Vitor Bianchi Lanzetta Feb 22 '18 at 08:51
  • Thank you for the reply. Unfortunately, that doesn't seem to create the line plot I'm looking for. I've added the results after the line above to show what that produces. Basically, ..count../sum(..count..) only seems to work if you have a single histogram. When you have multiples the sum(..count..) divides by the total sum of all histograms and gives you fractions that are too low. – Nathan Feb 22 '18 at 15:51
  • If that is the case I have an solution that lacks elegance but shall work. Filter your data frame per label and use 3 layers instead of 1. – Vitor Bianchi Lanzetta Feb 23 '18 at 16:23
  • @VitorBianchiLanzetta thanks for the comment. that does seem to work. I'd still like something a little more... elegant if it were possible. It doesn't seem like too strange of a plot. Maybe I'll spend the time to edit ggplot2 source and submit it to the developers for them to include. – Nathan Feb 27 '18 at 04:47
  • I got a feeling that maybe `foreach` is the way to go but I don't have the time to tackle it right now. You should try it :D – Vitor Bianchi Lanzetta Mar 03 '18 at 01:55
  • I struggled with this for quite some time - finally someone who gets it and stops simply suggesting to use `..density..`! Coming from a Python world, it really shouldn't be this hard. Thanks for your solution! – Q-man Jan 07 '22 at 21:46

1 Answers1

0

To use counts for each category, maybe position="stack"?

ggplot(df, aes(x=values, fill=labels)) + 
  geom_histogram(aes(y=..density..), 
                 breaks= seq(0, 80, by = 2),
                 alpha=0.4, 
                 position="stack") +
  geom_density(alpha=.2, position="stack")

It gives me this distribution, but still seems different than your second plot.

Bruno Pinheiro
  • 964
  • 8
  • 20
  • Thanks for the reply. I've added the code to my post as another method that doesn't seem to work. After sleeping on it and looking at the line plot code again I still don't see what I could be doing wrong. It also seems crazy that I can produce half a dozen different histograms, all normalized in different ways, and none of them are the way that makes the most sense to me. – Nathan Feb 22 '18 at 15:48