1

First things first, I got the 2 mixed distributions (they have mixed part) and I've known the samples come from which distribution. Then I want to plot histogram according to the samples' density and the mixture distribution.

Let's head to the code (seg 1):

library(mixtools)

# two components
set.seed(1)    # for reproducible example
b1 <- rnorm(900000, mean=8, sd=2) # samples
b2 <- rnorm(100000, mean=17, sd=2)

# densities corresponding to samples
d = dnorm(c(b1, b2), mean = 8, sd = 2)*.9 + dnorm(c(b1, b2), mean = 17, sd = 2)*.1 

# ground truth
b <- data.frame(ss=c(b1,b2), dd=d, gg=factor(c(rep(1, length(b1)), rep(2, length(b2))))) 

# sample from mixed distribution
c <- b[sample(nrow(b), 500000),] 

library(ggplot2)
ggplot(data = c, aes(x = ss)) +
  geom_histogram(aes(y = stat(density)), binwidth = .5, alpha = .3, position="identity") +
  geom_line(data = c, aes(x = ss, y = dd), color = "red", inherit.aes = FALSE)

this result is fine: like this

But I want to fill the color according to the samples' group. So I change the code (seg 2):

ggplot(data=c, aes(x=ss)) +
  geom_histogram(aes(y=stat(density), fill=gg, color=gg), 
                 binwidth=.5, alpha=.3, position="identity") +
  geom_line(data=c, aes(x=ss, y=dd), color="red", inherit.aes=FALSE)

the result is wrong. R calculate the density of two parts separately. So the two part looks like the same height.

Then I found some methods like this (seg 3):

breaks = seq(min(c$ss), max(c$ss), .5) # form cut points
bins1 = cut(with(c, ss[gg==1]), breaks) # form intervals by cutting
bins2 = cut(with(c, ss[gg==2]), breaks)
cnt1 = sapply(split(with(c, ss[gg==1]), bins1), length) # assign points to its interval
cnt2 = sapply(split(with(c, ss[gg==2]), bins2), length)
h = data.frame(
  x = head(breaks, -1)+.25,
  dens1 = cnt1/sum(cnt1,cnt2), # height of density bar
  dens2 = cnt2/sum(cnt1,cnt2)
  # weight = sapply(split(samples.mixgamma$samples, bins), sum)
)
ggplot(h) +
  geom_bar(aes(x, dens1), fill="red", alpha = .3, stat="identity") +
  geom_bar(aes(x, dens2), fill="blue", alpha = .3, stat="identity") +
  geom_line(data=c, aes(x=ss, y=dd), color="red", inherit.aes=FALSE)

or set y=stat(count)/sum(stat(count)) like this (seg 4):

ggplot(data=c, aes(x=ss)) +
  geom_histogram(aes(y=stat(count)/sum(stat(count)), fill=gg, color=gg), 
                 binwidth=.5, alpha=.3, position="identity") +
  geom_line(data=c, aes(x=ss, y=dd), color="red", inherit.aes=FALSE)

the results are the same and wrong, all the bars are about half as tall as seg 1.

So if I want to fill the 2 groups with different color with mixture like seg 2 and the right proportion like seg 1 and avoid the mistake like seg 3 and seg 4, what can I do?

Many thanks!

plot

The solution is that: probability density should be calculated as y=stat(count)/.5/sum(stat(count)). I only do the normolization but not divide mass by it's volume. So the answer such as this and seg 3 need to be modified

  • You may want to try applying the solution [here](https://stackoverflow.com/questions/37404002/geom-density-to-match-geom-histogram-binwitdh) to seg 4. – Z.Lin Sep 06 '18 at 08:14
  • Thanks, I think I find the answer. The density was calculated incorrectly. It should be `y=stat(count)/.5/sum(stat(count))` – Joseph WANG Sep 06 '18 at 10:20

0 Answers0