1

For those without a statistics background but who could assist with coding: I have multiple small datasets (around 30 to 200 rows each) and need to determine the best-fitting distribution. One approach is visual inspection, Here I focus on the Normal and Beta distributions for sake of simplicity.

The Beta distribution has a constraint: data must be within [0, 1]. Other distributions don't have such a restriction. In case data falls out of this range, we can scale it to [0, 1] and that is no problem.

JMP (which is a strong statistical software) works fine no matter how big of data set it is working with. I'm trying to replicate the same graphs in R. When dealing with small datasets (less than 80 rows), I get meaningless results in R. Larger datasets (around 5000 data points) produce consistent results in both R and JMP. Any ideas?

Plz remember all my datasets are small so I get meaningless graphs in R. I am unable to share my data so we work with generated data.

I have the following code:

get_histogram <- function(data_set, column_name, bin_width) {
  
  colname <- as_label(enquo(column_name))
  x <- data_set[[colname]]
  shift <- min(x)
  scale <- diff(range(x))
  norm_it <- function(x) (x - shift + 1e-8) / (diff(range(x)) + 2e-8)
  
  beta_par    <- fitdistr(norm_it(x), dbeta, 
                          start = list(shape1 = 1, shape2 = 1))$estimate
  
  ggplot(data_set, aes(x = {{ column_name }})) +
    geom_histogram(aes(y = after_stat(density)), binwidth = bin_width, 
                   fill = "lightblue", colour="black") +
    stat_function(fun = dnorm, args = list(mean = mean(x), sd = sd(x)), 
                  mapping = aes(colour = "Normal")) +
    stat_function(fun = function(x) dbeta(norm_it(x), beta_par[1], beta_par[2])/(scale),
                  mapping = aes(color = "Beta")) +
    scale_colour_manual("Distribution", 
                        values = c("red", "blue"))
}

I've kept a section of my code that overlays two distributions (Beta and Normal). When I generate random data with fewer than 80 points, even if it's Beta-distributed, the graph doesn't make sense. This issue doesn't happen with more data points. Note that most of my real datasets have fewer than 80 rows, so the problem remains with real data.

set.seed(2122)
test_dt <- rnorm(50, 30 , 2)
new_col <- sample(size = 50, x= c("a", "b", "c"), replace = TRUE)
df <- data.frame(test_dt, new_col)
rm(test_dt)
rm(new_col)

get_histogram(df, test_dt, .05)

When I fit a beta distribution in JMP, I get a nice result, but R's output is completely incorrect.

enter image description here

This doesn't occur with more data, like 5000 points, where R and JMP outputs are very similar, as shown:

enter image description here

I have two questions:

  1. In stat_function(), should the denominator be / scale or / (scale + shift) when overlaying the Beta distribution? Some argue for the first option, even though the second one seems mathematically correct.
  2. Why does this code struggle with small datasets? Typically, any sample larger than 30 can fit a distribution, but this isn't the case here. Any insights? Grateful for your help with this R issue.

Compare the y-axes in these photos. The first one displays density probabilities of up to 400, which is statistically meaningless. Identifying the cause of this issue could be the answer to all my questions.

Joe the Second
  • 339
  • 1
  • 11
  • 2
    It would be much easier to debug this if you can generate a **minimal** example, i.e. break out just the Beta-fitting machinery and give us a specific example where it goes wrong ... – Ben Bolker Aug 23 '23 at 02:33
  • Thanks a lot for your feedback @BenBolker! I edited the question and retained only two overlaying distributions. This way, when someone attempts to provide an answer, they can consider that scaling the entire dataset initially can affect the overlay of other distributions. If this is not yet considered a minimal example, please let me know, and I can further edit it. Thanks. – Joe the Second Aug 23 '23 at 16:30
  • 2
    The JMP version doesn't do any rescaling. If you take that out of the parameter estimates and the stat_function you get the expected curve because your data is already between 0 and 1 https://i.stack.imgur.com/72svA.png. And notice how all your values are large (>0.75). When you rescale, you now have a very different range a numbers with a very different distribution all across 0-1. You are taking away the information that all your values are large values. – MrFlick Aug 24 '23 at 17:13
  • That's a very good point. Initially, I generated normal values, and when I edited my question, I used rbeta(). I totally forgot that it would be wrong to scale data that is already within the range (0,1). However, the problem remains since my real data is not beta distributed (it falls outside the [0,1] range) so I have to scale my data, and the graphs I obtain are meaningless. Please replace rbeta() with rnorm() or other distributions, and you will get similar results. – Joe the Second Aug 24 '23 at 17:32
  • 4
    The results of JMP show 4 parameters for beta. This is not the usual beta distribution formulation. You are only fitting a two parameter model. There's an existing question for fitting a parameter beta distribution https://stackoverflow.com/questions/49416540/four-parameter-beta-distribution-in-r. There are really statistical issues, not programming problems. To understand the difference you probably want to at the statisticians over at [stats.se] instead. – MrFlick Aug 24 '23 at 19:52
  • 1
    "my real data is not beta distributed" Why are you trying to shoehorn data that isn't beta-distributed into a beta distribution? You should focus on the actual distribution of the data. You can always scale the data and the distribution afterwards. – Roland Aug 25 '23 at 09:11
  • @Roland, sometimes when working with real-world data, you don't know the true underlying distributions. You can use technical statistical techniques such as AIC measurements or Q-Q plots, but at the end of the day, visualization is a great way to observe the distribution that the data follows. So, in my case, when I say that my data is not Beta, it means that we cannot be certain it is not Beta; furthermore, it has a range larger than [0,1]. However, it might still turn out that Beta is the best-fitting distribution – Joe the Second Aug 25 '23 at 15:47
  • 1
    No, it won’t. There is an infinite number of distributions and you can never be sure which is the true distribution but you can be sure it’s not a beta distribution. Anyway, why are you even fitting a distribution model? There is a good chance that’s not even necessary. – Roland Aug 25 '23 at 17:14

3 Answers3

2

Based on the comment from @Mrflick the following code produces a similar output as above but of course, without knowing which distribution JMP uses it is impossible to get the exact same results.

The solution uses the 4-parameter generalized beta distribution which could be the same one as from JMP. The function eBeta_ab(...) estimates the parameters and together with dBeta_ab() comes from the ExtDist package. The estimated parameters are close to the ones in JMP. Since the generalized beta does not have the same strict boundaries as the beta, we do not need to shift or scale anything.

library(fitdistrplus)
library(ExtDist)
library(ggplot2)

get_histogram <- function(data_set, column_name, bin_width) {
  
  colname <- as_label(enquo(column_name))
  x <- data_set[[colname]]
  beta_par <- eBeta_ab(x)
  print(beta_par)
  ggplot(data_set, aes(x = {{ column_name }})) +
    geom_histogram(aes(y = after_stat(density)), binwidth = bin_width, 
                   fill = "lightblue", colour="black") +
    stat_function(fun = dnorm, args = list(mean = mean(x), sd = sd(x)), 
                  mapping = aes(colour = "Normal")) +
    stat_function(fun = function(x) dBeta_ab(x, 
                                             beta_par[1]$shape1, 
                                             beta_par[2]$shape2, 
                                             beta_par[3]$a, 
                                             beta_par[4]$b),
                  mapping = aes(color = "Generalized Beta")) +
    scale_colour_manual("Distribution", 
                        values = c("red", "blue"))
}

set.seed(2122)
test_dt <- rnorm(50, 30 , 2)
new_col <- sample(size = 50, x= c("a", "b", "c"), replace = TRUE)
df <- data.frame(test_dt, new_col)

get_histogram(df, test_dt, .5)

enter image description here

enter image description here

pookpash
  • 738
  • 4
  • 18
1

Mainly to answer your Q2:

I don't think there is any problem with the plotting as of course you can have densities greater than 1 e.g. dnorm(0,0,0.1).

But you do have problems with the numerical optimisation from fitdistr. If I look at your estimates for beta_par, I get

beta_par
shape1    shape2 
0.6743766 0.5413656 

This explains why you get the apparent blowing up on the right-hand side of your plot. And also explains why it works well when you have more data - the optimisation finds the correct solution easily.

So what can you do? There are other packages for fitting e.g. fitdistrplus as I read about here. I tried it quickly but I didn't get any improvement - you may need to use more advanced options e.g. increase number of iterations, reduce tolerances, try another method. They have a nice vignette about which optimisation algorithm to choose including a beta distribution example.

nstjhp
  • 528
  • 6
  • 12
0

In the plots you shared that were generated from JMP, the beta function is using 4 parameters and reviewing the code you shared for R, the beta function only uses 2 parameters. If they were using the same # of parameters and the same #s, you would get identical charts.

user2813606
  • 797
  • 2
  • 13
  • 37