0

I'm trying to superimpose a mixed distribution plot with a plot of identified component distributions, using ggplot2 package and a user-defined function for its stat_function(). I have tried two approaches. The distribution identification is normal in both cases:

number of iterations= 11 
summary of normalmixEM object:
         comp 1  comp 2
lambda 0.348900 0.65110
mu     2.019878 4.27454
sigma  0.237472 0.43542
loglik at estimate:  -276.3643 

A) However, in the first approach, the output contains the following error:

Error in eval(expr, envir, enclos) : object 'comp.number' not found

The reproducible example for this approach follows (faithful is a built-in R dataset):

library(ggplot2)
library(mixtools)

DISTRIB_COLORS <- c("green", "red")
NUM_COMPONENTS <- 2

set.seed(12345)

mix.info <- normalmixEM(faithful$eruptions, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

plot.components <- function(mix, comp.number) {
  g <- stat_function(fun = function(mix, comp.number) 
  {mix$lambda[comp.number] *
     dnorm(x, mean = mix$mu[comp.number],
           sd = mix$sigma[comp.number])}, 
  geom = "line", aes(colour = DISTRIB_COLORS[comp.number]))
  return (g)
}

g <- ggplot(faithful, aes(x = waiting)) +
  geom_histogram(binwidth = 0.5)

distComps <- lapply(seq(NUM_COMPONENTS),
                    function(i) plot.components(mix.info, i))
print(g + distComps)

B) The second approach doesn't produce any errors. However, the only plot visible is the one of the mixed distribution. Plots of its component distributions are not produced or visible (with some degree of confidence it seems to me that the a straight horizontal line y=0 is also visible, but I'm not 100% sure):

enter image description here

The following is a reproducible example for this approach:

library(ggplot2)
library(mixtools)

DISTRIB_COLORS <- c("green", "red")
NUM_COMPONENTS <- 2

set.seed(12345)

mix.info <- normalmixEM(faithful$eruptions, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

plot.components <- function(x, mix, comp.number, ...) {
  mix$lambda[comp.number] *
    dnorm(x, mean = mix$mu[comp.number],
          sd = mix$sigma[comp.number], ...)
}

g <- ggplot(faithful, aes(x = waiting)) +
  geom_histogram(binwidth = 0.5)

distComps <- lapply(seq(NUM_COMPONENTS), function(i)
  stat_function(fun = plot.components,
                args = list(mix = mix.info, comp.number = i)))
print(g + distComps)

Question: What are the problems in each of the approaches and which one is (more) correct?

UPDATE: Just minutes after posting, I realized that I forgot to include the line-drawing part of the stat_function() for the second approach, so that the corresponding lines are as follow:

distComps <- lapply(seq(NUM_COMPONENTS), function(i)
  stat_function(fun = plot.components,
                args = list(mix = mix.info, comp.number = i)),
  geom = "line", aes(colour = DISTRIB_COLORS[i]))

However, this update produces an error, source of which I don't quite understand:

Error in FUN(1:2[[1L]], ...) : 
  unused arguments (geom = "line", list(colour = DISTRIB_COLORS[i]))
Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • 3
    You're in a real mess here. Your normalmixEM function is being called on `$eruptions`, so its looking at the distribution of that variable, but your plots are based on `x=waiting` which is some completely different variable. Look at the summary output means and variances, they are nowhere near your X-axis values. You're probably seeing the tail of distributions centred at 2.019 and 4.275. Fix all that, then we'll work on the various scoping problems and the fact that fun should be a function of x only... – Spacedman Aug 29 '14 at 14:35
  • @Spacedman: Thank you! Already started working on this. – Aleksandr Blekh Aug 29 '14 at 14:37
  • @Spacedman: I fixed the wrong variable issue (changed to `$waiting` for both approaches) and see improvement in components identification. But the error messages remain the same. Still trying to figure out the scaling/scope issue. – Aleksandr Blekh Aug 29 '14 at 14:43
  • Fixed error in Approach 2 by allowing extra parameters (`...`). After reading this info on StackOverflow (http://stackoverflow.com/a/25091231/2872891) and linked comments by Hadley, I understand that all calculations should be made externally to `stat_function()` and other `ggplot2` functions due to environment scoping. This partially matches my Approach 2, so I'm concentrating on fixing it by forming a supplementary data frame with calculation results and passing it then to `geom_line()`. – Aleksandr Blekh Aug 29 '14 at 15:34

1 Answers1

3

Finally I have figured out how to do what I wanted and reworked my solution. I've adapted parts of answers by @Spacedman and @jlhoward for this question (which I haven't seen at the time of posting my question): Any suggestions for how I can plot mixEM type data using ggplot2. However, my solution is a little different. On one hand, I've used @Spacedman's approach of using stat_function() - the same idea I've tried to use in my original version - I like it better than the alternative, which seems a bit too complex (while more flexible). On the other hand, similarly to @jlhoward's approach, I've simplified parameter passing. I've also introduced some visual improvements, such as automatic selection of differentiated colors for the easier component distributions identification. For my EDA, I've refactored this code as an R module. However, there is still one issue, which I'm still trying to figure out: why the component distribution plots are located below the expected density plots, as shown below. Any advice on this issue will be much appreciated!

UPDATE: Finally, I've figured out the issue with scaling and updated the code and the figure accordingly - the y values need to be multiplied by the value of binwidth (in this case, it's 0.5) to account for the number of observations per bin.

enter image description here

Here's the complete reworked reproducible solution:

library(ggplot2)
library(RColorBrewer)
library(mixtools)

NUM_COMPONENTS <- 2

set.seed(12345) # for reproducibility

data <- faithful$waiting # use R built-in data

# extract 'k' components from mixed distribution 'data'
mix.info <- normalmixEM(data, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

numComponents <- length(mix.info$sigma)
message("Extracted number of component distributions: ",
        numComponents)

calc.components <- function(x, mix, comp.number) {
  mix$lambda[comp.number] *
    dnorm(x, mean = mix$mu[comp.number], sd = mix$sigma[comp.number])
}

g <- ggplot(data.frame(x = data)) +
  geom_histogram(aes(x = data, y = 0.5 * ..density..),
                 fill = "white", color = "black", binwidth = 0.5)

# we could select needed number of colors randomly:
#DISTRIB_COLORS <- sample(colors(), numComponents)

# or, better, use a palette with more color differentiation:
DISTRIB_COLORS <- brewer.pal(numComponents, "Set1")

distComps <- lapply(seq(numComponents), function(i)
  stat_function(fun = calc.components,
                arg = list(mix = mix.info, comp.number = i),
                geom = "line", # use alpha=.5 for "polygon"
                size = 2,
                color = DISTRIB_COLORS[i]))
print(g + distComps)
Community
  • 1
  • 1
Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64