How do I fit a Gaussian curve to this data?

Question

I am trying to get a curve to fit to this scatter data that gives me a Gaussian curve:

library(tidyverse)
MK20 <- tribble(~X.Intensity,    ~Average,
             0.400,  0.0000000,
             0.463,  0.0000000,
             0.536,  0.000000,
             0.621,  0.0000000,
             0.719,  0.0000000,
             0.833,  0.0000000,
             0.965,  0.0000000,
             1.120,  0.0000000,
             1.290,  0.0000000,
             1.500,  0.0000000,
             1.740,  0.0000000,
             2.010,  0.0000000,
             2.330,  0.0000000,
             2.700,  0.0000000,
             3.120,  0.0000000,
             3.620,  0.0000000,
             4.190,  0.0000000,
             4.850,  0.0000000,
             5.610,  0.0000000,
             6.500,  0.0000000,
             7.530,  0.0000000,
             8.720,  0.0000000,
             10.100,  0.0000000,
             11.700,  0.0000000,
             13.500,  0.0000000,
             15.700,  0.0000000,
             18.200,  0.0000000,
             21.000,  0.0000000,
             24.400,  0.0000000,
             28.200,  0.0000000,
             32.700,  0.0000000,
             37.800,  0.0000000,
             43.800,  0.7023333,
             50.700,  3.3700000,
             58.800,  7.3933333,
             68.100, 11.4666667,
             78.800, 14.3666667,
             91.300, 15.4000000,
             106.000, 14.5000000,
             122.000, 12.0000000,
             142.000,  8.6433333,
             164.000,  5.2200000,
             190.000,  2.4500000,
             220.000,  0.7580000,
             255.000,  0.1306667,
             295.000,  0.0000000,
             342.000,  0.0000000,
             396.000,  0.0000000,
             459.000,  0.0000000,
             531.000,  0.0000000,
             615.000,  0.0000000,
             712.000,  0.0000000,
             825.000,  0.0000000,
             955.000,  0.0000000,
             1110.000,  0.0000000,
             1280.000,  0.0000000,
             1480.000,  0.0000000,
             1720.000,  0.0000000,
             1990.000,  0.0000000,
             2300.000,  0.0000000,
             2670.000,  0.0000000,
             3090.000,  0.0000000,
             3580.000,  0.0000000,
             4150.000,  0.0000000,
             4800.000,  0.0000000,
             5560.000,  0.0000000,
             6440.000,  0.0000000,
             7460.000,  0.0000000,
             8630.000,  0.0000000)

The code I'm using to plot is:

plot(log10(MK20$X.Intensity), MK20$Average, col=1, xlim=c(-0.5,4), 
    ylim=c(0,20), xlab="Log(Average diameter)", ylab="Intensity", xaxt='n')

I'm using the minor.tick.axis function to add minor ticks on the logarithmic x axis. I want to add a Gaussian curve (which fits best) to this data. I tried to add a type='l' on the plot but the curve wasn't smooth and I don't want a curve that necessarily touches every data point but one that fits best.

This seems to be a duplicate of the linked target question. I recommend taking a look at the answers to the dupe target question. Neither answers posted below show how to *fit* a normal distribution to data. — Maurits Evers, Dec 10 '19 at 11:25

score 0 · Answer 1 · answered Dec 10 '19 at 10:22

A curve that touches each point will fit your data best for sure. :)

That aside, you could try to include a smoothed curve, e.g.

plot(log10(MK20$X.Intensity), MK20$Average, col=1, xlim=c(-0.5,4), ylim=c(0,20), 
     xlab="Log(Average diameter)", ylab="Intensity", xaxt='n', type='n')
lines(lowess(MK20$Average ~ log10(MK20$X.Intensity), f=0.3))

You can vary the f= parameter between (0 and 1) to change the level of smoothing.

Here's what the output looks like with f=0.3.

Tony Ladson · Answer 2 · 2019-12-11T13:29:55.513

0

We can't use the usual fitdistr approach to fit a normal distribution in this case because we don't have the original data. It looks like the 'Average' column is some type of density estimate. If it was a pdf then it should integrated to 1 but it doesn't.

f <- approxfun(x = log10(MK20$X.Intensity), y= MK20$Average)
integrate(f, lower = log10(0.4), upper = log10(8630))

#6.142134 with absolute error < 0.00043

So we could turn this into a pdf by scaling it down by about 6.14 and then try finding the mean and standard deviation to match that pdf.

Here is a first attempt at a simple Gaussian fit. First I chose a mean of 2 (by looking at where the density was the greatest), a scaling factor of k = 6.14 (the value of the integral) and then played with the sd until there was a reasonable fit.

m=2
s=0.15
k=6.14
x_seq = seq(1,3,length.out = 100)
df <- tibble(x_seq = x_seq, dens = dnorm(x_seq, m, s))


MK20 %>% 
  mutate(log_intensity = log10(X.Intensity)) %>% 
  ggplot(aes(log_intensity, Average/k)) +
  geom_point() +
  geom_line(data = df, aes(x_seq, dens))

Next I used optimx to fit the 3 parameters (k = scaling factor, m = mean, s = standard deviation) by minimising the sum of squares between the fit and the data.

Objective function (sum of squares of differences between fit and data)

f <- function(x) {
  k = x[1]
  m = x[2]
  s = x[3]
  MK20 %>% 
  mutate(log_intensity = log10(X.Intensity)) %>%
  mutate(fit = dnorm(log_intensity, m, s)) %>% 
  summarise(sum((fit - Average/k)^2)) %>% pull
}

Use optimx to find parameters (minimise sum of squares) The initial values for the parameters are taken from the fit by eye.

library(optimx)    
optimx(par = c(6.14, 2, 0.15), fn = f )

#k = 6.294696 m = 1.971488 s= 0.1583936

Lets replot with the fitted parameters

# points for a gaussian
x_seq = seq(1,3,length.out = 100) 
df <- tibble(x_seq = x_seq, dens = dnorm(x_seq, m, s))


MK20 %>% 
  mutate(log_intensity = log10(X.Intensity)) %>% 
  ggplot(aes(log_intensity, Average/k)) +
  geom_point() +
  geom_line(data = df, aes(x_seq, dens))

edited Dec 11 '19 at 13:29

answered Dec 10 '19 at 11:05

Tony Ladson

3,539
1
23
30

This is not a fit. You are plotting a normal distribution for a chosen set of parameters (mean and variance). A fit is an inference procedure to obtain estimates for those parameters conditioned on the data. Can you revise your answer to show an actual fit? – Maurits Evers Dec 10 '19 at 11:29
@MauritsEvers my parameters are conditioned on the data - I manually chose the parameters to get a fit by eye. However you are right, it is possible and appropriate to do better. Have now added a numerical fitting procedure. – Tony Ladson Dec 10 '19 at 20:20
Manually choosing parameters to *"get a fit by eye"* is not how the statistics community usually interprets the question how to fit a model to data. In fact, this can be very dangerous, as the eye can be a very poor judge (see e.g. [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)). Don't get this the wrong way but this is not a good answer. First off, it demonstrates poor statistical practice. Furthermore in your edit, you don't give any explanations (parameters, code, algorithm, etc.) demonstrating the use of `optimx`. [...] – Maurits Evers Dec 10 '19 at 21:11
[continued] The canonical way to fit a normal distribution to data is to use `MASS:: fitdistr` or `fitdistrplus::fitdist` in which case an answer to OPs question becomes a one-liner (see the dupe target). Stack Overflow aims to build a comprehensive catalog of *good* questions & answers, that will be of use to people having similar issues. For the reasons given above, I don't think your "answer" ticks those boxes. – Maurits Evers Dec 10 '19 at 21:14
@MauritsEvers perhaps I'm misinterpreting the question but I don't think the `fitdistr` approach will work here because we don't have the original data. We have some type of density estimate. Anyway, I've edited my answer to explain where I'm coming from. I've also followed your advice to explain what optimx is doing. – Tony Ladson Dec 12 '19 at 21:59
I agree with you on that `fitdistr` will not work here. I've removed my downvote. I still think this is not a very good answer (aside from the fact that it is a duplicate question). It uses non-standard terminology (a more canonical and conventional statistical approach would define a likelihood function `dnorm(x, mean, sd)` and then use `mle` to obtain MLE estimates for `mean` and `sd`, instead of essentially re-inventing the wheel and *"minimising the sum of squares between the fit and the data"*). My biggest issue is still with the whole "by-eye" business. Teaches very bad practices. – Maurits Evers Dec 16 '19 at 03:31

How do I fit a Gaussian curve to this data?

2 Answers2