2

This is a follow-up to my previous question Integrating ggplot2 with user-defined stat_function(), which I've answered myself yesterday. My current problem is that, in the following reproducible example, lines, which are supposed to plot components of the data values' mixture distribution, neither appear in the expected places, nor they're of expected shape, as shown below (see the red lines at y=0 in the second figure).

enter image description here

enter image description here

Complete reproducible example:

library(ggplot2)
library(scales)
library(RColorBrewer)
library(mixtools)

NUM_COMPONENTS <- 2

set.seed(12345) # for reproducibility

data(diamonds, package='ggplot2')  # use built-in data
myData <- diamonds$price

# extract 'k' components from mixed distribution 'data'
mix.info <- normalmixEM(myData, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

numComponents <- length(mix.info$sigma)
message("Extracted number of component distributions: ",
        numComponents)

calc.components <- function(x, mix, comp.number) {

  mix$lambda[comp.number] *
    dnorm(x, mean = mix$mu[comp.number], sd = mix$sigma[comp.number])
}

g <- ggplot(data.frame(x = myData)) +
  scale_fill_continuous("Count", low="#56B1F7", high="#132B43") + 
  scale_x_log10("Diamond Price [log10]",
                breaks = trans_breaks("log10", function(x) 10^x),
                labels = prettyNum) +
  scale_y_continuous("Count") +
  geom_histogram(aes(x = myData, fill = 0.01 * ..density..),
                 binwidth = 0.01)
print(g)

# we could select needed number of colors randomly:
#DISTRIB_COLORS <- sample(colors(), numComponents)

# or, better, use a palette with more color differentiation:
DISTRIB_COLORS <- brewer.pal(numComponents, "Set1")

distComps <- lapply(seq(numComponents), function(i)
  stat_function(fun = calc.components,
                arg = list(mix = mix.info, comp.number = i),
                geom = "line", # use alpha=.5 for "polygon"
                size = 1,
                color = "red")) # DISTRIB_COLORS[i]
print(g + distComps)

UPDATE: Just a quick note on my efforts. I have additionally tried several other options, including converting the plot's x-axis scale to normal and requesting original data values' log transformation in the histogram part, like this: geom_histogram(aes(x = log10(data), fill = ..count..), binwidth = 0.01), but the end result still remains the same. In regard to my first comment, I realized that the transformation I have mentioned is not needed as long as I'm using reference to the ..count.. object.

UPDATE 2: Changed color of line, produced by stat_function(), to red, to clarify the problem.

Community
  • 1
  • 1
Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • Just realized that, for both this and the previous questions, I probably need to **multiply** component distribution *data values* to the *total number of elements* in each component distribution (they're equal in our case) in order to move from *density distributions* to *count distributions*. If it makes sense, then how should I do it, using `stat_function()`? I guess, by adding a multiplier as a corresponding parameter to the `calc.components` function and a corresponding argument to the `stat_function`'s `arg` list. – Aleksandr Blekh Sep 01 '14 at 05:41
  • 2
    I downvoted this question because it is too long-winded and the bold face decreases readability. Please make your questions more to the point. Also, you acknowledge that we want minimal reproducible examples. Please try to create one. – Roland Sep 01 '14 at 07:31
  • 1
    @Roland: I have no problem with downvoting, as long as it's substantiated, like you just did. Sorry about bold font - I was trying to emphasize important elements/points. Will limit its use in the future and will try to provide more compact questions. In regard to reproducible example, I've just created one and will update my question with it shortly. Thank you for your help! – Aleksandr Blekh Sep 01 '14 at 07:38
  • In addition to what @Roland says, you should be careful with zeroes and log-axis combined, maybe that's what causes your problems in the first place. – tonytonov Sep 01 '14 at 07:39
  • 1
    @tonytonov: Thank you, will keep this in mind. Updating my question with reproducible example within next 5 minutes... – Aleksandr Blekh Sep 01 '14 at 07:43
  • 1
    Thanks for editing. Much better now. – Roland Sep 01 '14 at 09:48
  • Just a quick note on my efforts. I have additionally tried several options, including converting the plot's x-axis scale to normal and requesting original data values' log transformation in the histogram part, like this: `geom_histogram(aes(x = log10(data), fill = ..count..), binwidth = 0.01)`, but the end result remains the same. In regard to my first comment, I figured that the transformation I've mentioned is not needed as long as I'm using reference to the `..count..` object. – Aleksandr Blekh Sep 01 '14 at 14:27
  • In trying to figure out this issue, I ran across this statement about `ggplot2`: "Any aesthetic defined in a base layer by ggplot() is passed on to all subsequent layers." Could someone clarify that the same applies not only to aesthetic, but to a **scale**, defined in a *base* layer? Thanks! – Aleksandr Blekh Sep 02 '14 at 16:44

1 Answers1

3

Finally, I have figured out the issues, removed my previous answer and I'm providing my latest solution below (the only thing I haven't solved is legend panel for components - it doesn't appear for some reason, but for an EDA to demonstrate the presence of mixture distribution I think that it is good enough). The complete reproducible solution follows. Thanks to everybody on SO who helped w/this directly or indirectly.

library(ggplot2)
library(scales)
library(RColorBrewer)
library(mixtools)

NUM_COMPONENTS <- 2

set.seed(12345) # for reproducibility

data(diamonds, package='ggplot2')  # use built-in data
myData <- diamonds$price


calc.components <- function(x, mix, comp.number) {

  mix$lambda[comp.number] *
    dnorm(x, mean = mix$mu[comp.number], sd = mix$sigma[comp.number])
}


overlayHistDensity <- function(data, calc.comp.fun) {

  # extract 'k' components from mixed distribution 'data'
  mix.info <- normalmixEM(data, k = NUM_COMPONENTS,
                          maxit = 100, epsilon = 0.01)
  summary(mix.info)

  numComponents <- length(mix.info$sigma)
  message("Extracted number of component distributions: ",
          numComponents)

  DISTRIB_COLORS <- 
    suppressWarnings(brewer.pal(NUM_COMPONENTS, "Set1"))

  # create (plot) histogram and ...
  g <- ggplot(as.data.frame(data), aes(x = data)) +
    geom_histogram(aes(y = ..density..),
                   binwidth = 0.01, alpha = 0.5) +
    theme(legend.position = 'top', legend.direction = 'horizontal')

  comp.labels <- lapply(seq(numComponents),
                        function (i) paste("Component", i))

  # ... fitted densities of components
  distComps <- lapply(seq(numComponents), function (i)
    stat_function(fun = calc.comp.fun,
                  args = list(mix = mix.info, comp.number = i),
                  size = 2, color = DISTRIB_COLORS[i]))

  legend <- list(scale_colour_manual(name = "Legend:",
                                     values = DISTRIB_COLORS,
                                     labels = unlist(comp.labels)))

  return (g + distComps + legend)
}

overlayPlot <- overlayHistDensity(log10(myData), 'calc.components')
print(overlayPlot)

Result:

enter image description here

Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64