0

Users of a Shiny app can test data sets for Poisson, normality, and exponentiality. I am returning the results of the statistical test they chose. In addition, I thought it would be nice to plot the density from the data along with the theoretical distribution. They could be testing multiple sets of data at once, so I am faceting the plot.

From ggplot add Normal Distribution while using `facet_wrap` I found the really great ggh4x package. However, since this could be industry data, there may be a minimum that is not zero.

The problem is that theodensity(distri="exp") uses dexp which doesn't account for a minimum number, so the theoretical distribution plot doesn't match the data.

How can I tell the stat_theodensity that there is an xmin for each facet, which is the min of the data in the facet? I see that fitdistrplus can use different methods to fit an exponential curve, and that, for example, method="mse" would work. Is there a way to pass this through stat_theodensity?

library(ggh4x)

#generate 2 exponential distributions with xmin > 0
data1 <- rexp(n = 500,rate = 1/100)+100
data2 <- rexp(n = 500,rate = 1/250)+500
data <- c(data1,data2)

#generate a code for facets
ID1 <- c(rep("Set 1",times=500))
ID2 <- c(rep("Set 2",times=500))
ID <- c(ID1,ID2)

#make the data for plotting
plot_dat <- data.frame(ID,data)

#make the graph
p <- ggplot(data = plot_dat, aes(x=data))+
  geom_density()+
  stat_theodensity(distri = "exp")+
  facet_wrap(facets = ~ID,scales = "free")
p

#what the first point of the graphs should be
dexp(x = 100-100,rate = 1/100)
#[1] 0.01
dexp(x = 500-500,rate = 1/250)
#[1] 0.004

********EDIT

OK I am getting closer. The following code works, but only for the second pass through the loop. If I change the numbers around for data1 and data2, it is always only the second one that plots the theoretical distribution.

I did ggplot_build after the first loop and it gives an error in fitdist(), which is code 100. I don't know why it would always fail on the first one but not on the second one, even with the same data.

Any ideas?

#generate 2 exponential distributions with xmin > 0
data1 <- rexp(n = 500,rate = 1/250)+500
data2 <- rexp(n = 500,rate = 1/100)+250
data <- c(data1,data2)

#generate a code for facets
ID1 <- c(rep("Set 1",times=500))
ID2 <- c(rep("Set 2",times=500))
ID <- c(ID1,ID2)

#make the data for plotting
plot_dat <- data.frame(ID,data)

#make the graph
p <- ggplot(data = plot_dat, aes(x=data))+
  geom_density(color="red")

#loop through sets and add facets
for (set in unique(plot_dat$ID)){
  xmin <- min(plot_dat$data[ID == set])
  p<-p+
    stat_theodensity(
      data = ~subset(.x, ID == set),
      aes(x = stage(data - xmin, after_stat = x + xmin)),
      distri = "exp"
    )
}

  #stat_theodensity(distri = "exp")+
p<-p+
  facet_wrap(facets = ~ID,scales = "free")
p
  • It wasn't at all difficult to find SO Q&A demonstrating how to use `itdist` for shifted exponential. You should do some searching and then edit you question to reflect your new understanding of the problem. – IRTFM Sep 26 '22 at 20:46
  • @IRTFM, What an appropriate user name, and how unfortunate your typo sent me looking for the package ```itdist```, lol! In my defense, I did a lot of searching, though not with fitdistrplus. I'll edit the question to ask how to use a different method with ```fitdistrplus``` within ```ggh4x``` – Steven Ouellette Sep 26 '22 at 21:34
  • Perhaps I was both naive _and_ careless. I assumed that after you wrote the d-,p-,q-, functions for the shifted exponential that it would "just work" with `stat_theodist`. The link I found was https://stackoverflow.com/questions/64366982/problem-in-finding-starting-values-for-shifted-exponential-distribution (I'm not sure whether you were hoping to estimate the minimum of the shifted distribution. If so it might require some work outside of the plotting universe before sending to `stat_theodensity`.) – IRTFM Sep 26 '22 at 23:42

1 Answers1

2

I don't know about the statistics of your problem, but if the issue is subtracting a number before calculating the density and afterwards adding it, you might do that with stage(). I couldn't find a more elegant way than hardcoding these values for each set separately, but I'd be happy to hear about more creative solutions.

library(ggh4x)
#> Loading required package: ggplot2

#generate 2 exponential distributions with xmin > 0
data1 <- rexp(n = 500,rate = 1/100)+100
data2 <- rexp(n = 500,rate = 1/250)+500
data <- c(data1,data2)

#generate a code for facets
ID1 <- c(rep("Set 1",times=500))
ID2 <- c(rep("Set 2",times=500))
ID <- c(ID1,ID2)

#make the data for plotting
plot_dat <- data.frame(ID,data)

#make the graph
ggplot(data = plot_dat, aes(x=data))+
  geom_density() +
  stat_theodensity(
    data = ~ subset(.x, ID == "Set 1"),
    aes(x = stage(data - 100, after_stat = x + 100)),
    distri = "exp"
  ) +
  stat_theodensity(
    data = ~ subset(.x, ID == "Set 2"),
    aes(x = stage(data - 500, after_stat = x + 500)),
    distri = "exp"
  ) +
  facet_wrap(facets = ~ID,scales = "free")

Created on 2022-09-26 by the reprex package (v2.0.1)

EDIT

I think OP's update had a problem with non-standard evaluation. It should work when you use a lapply() loop instead of a for-loop because then xmin is not a global variable that might be mistakingly looked up.

library(ggh4x)
#> Loading required package: ggplot2
library(ggplot2)

#generate 2 exponential distributions with xmin > 0
data1 <- rexp(n = 500,rate = 1/250)+500
data2 <- rexp(n = 500,rate = 1/100)+250
data <- c(data1,data2)

#generate a code for facets
ID1 <- c(rep("Set 1",times=500))
ID2 <- c(rep("Set 2",times=500))
ID <- c(ID1,ID2)

#make the data for plotting
plot_dat <- data.frame(ID,data)

#make the graph
p <- ggplot(data = plot_dat, aes(x=data))+
  geom_density(color="red") +
  facet_wrap(facets = ~ ID, scales = "free")

#loop through sets and add facets
p + lapply(unique(plot_dat$ID), function(i) {
  xmin <- min(plot_dat$data[plot_dat$ID == i])
  stat_theodensity(
    data = ~ subset(.x, ID == i),
    aes(x = stage(data - xmin, after_stat = x + xmin)),
    distri = "exp"
  )
})

Created on 2022-09-27 by the reprex package (v2.0.1)

teunbrand
  • 33,645
  • 4
  • 37
  • 63
  • Hi @teunbrand, and thanks for the extremely helpful package! I didn't know about ```stage()```. Hmm, in this case though we don't a priori know the minimum, and it might be different for each facet. Is there a way to use data-(min(data)) kind of thing? – Steven Ouellette Sep 26 '22 at 21:11
  • You could create a family of functions `dshiftexp`, `rshiftexp` and `qshiftexp` that are identical to the `exp` equivalents except they take an extra `shift` parameter, and pass `"shiftexp"` to the stat. I think it might be problematic to find the shift parameter though, since the distribution density is 0 below the shift threshold. – Allan Cameron Sep 26 '22 at 21:33
  • @AllanCameron, IRTFM in a comment above pointed out that `fitdistrplus` can fit an exponential with a minimum if I can have it use a different method. By convention in my world, we use the observed minimum for the shift parameter, so in principle that part is easy. It is getting it for each facet that I can't figure. – Steven Ouellette Sep 26 '22 at 21:40
  • 1
    @StevenOuellette You would need a separate layer for each facet, filtered by each faceting variable, as shown by Teunbrand. You can't change the method in `fitdistrplus` within `stat_theodensity` as far as I can tell (though @teunbrand wrote the function so will know better than I) – Allan Cameron Sep 26 '22 at 22:02
  • No I don't think the estimation method is exposed, but as long as {fitdistrplus} can find the distribution functions (and decent starting values are provided) it should be doable to use exotic distributions. – teunbrand Sep 27 '22 at 08:04
  • Please see edit on the original post. – Steven Ouellette Sep 27 '22 at 16:43
  • 1
    I think I solved the issue you reported in edit – teunbrand Sep 27 '22 at 19:08
  • @teunbrand Ahh, brilliant. That's fixed it. – Steven Ouellette Sep 27 '22 at 20:42
  • @teunbrand It does have one downside - it seems to always include 0 in the x-axis no matter the data. – Steven Ouellette Sep 27 '22 at 21:45
  • Yeah I noticed that too. Other than manually setting the x-axis limits, I wouldn't know a convenient way around this. – teunbrand Sep 27 '22 at 22:01