0

I have a list data like below. I want to perform nonlinear regression Gaussian curve fitting between mids and counts for each element of my list and report mean and standard deviation

mylist<- structure(list(A = structure(list(breaks = c(-10, -9, 
-8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4), counts = c(1L, 
0L, 1L, 5L, 9L, 38L, 56L, 105L, 529L, 2858L, 17L, 2L, 0L, 2L), 
    density = c(0.000276014352746343, 0, 0.000276014352746343, 
    0.00138007176373171, 0.00248412917471709, 0.010488545404361, 
    0.0154568037537952, 0.028981507038366, 0.146011592602815, 
    0.788849020149048, 0.00469224399668783, 0.000552028705492686, 
    0, 0.000552028705492686), mids = c(-9.5, -8.5, -7.5, -6.5, 
    -5.5, -4.5, -3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5), 
    xname = "x", equidist = TRUE), .Names = c("breaks", "counts", 
"density", "mids", "xname", "equidist"), class = "histogram"), 
    B = structure(list(breaks = c(-7, -6, -5, 
    -4, -3, -2, -1, 0), counts = c(2L, 0L, 6L, 2L, 2L, 1L, 3L
    ), density = c(0.125, 0, 0.375, 0.125, 0.125, 0.0625, 0.1875
    ), mids = c(-6.5, -5.5, -4.5, -3.5, -2.5, -1.5, -0.5), xname = "x", 
        equidist = TRUE), .Names = c("breaks", "counts", "density", 
    "mids", "xname", "equidist"), class = "histogram"), C = structure(list(
        breaks = c(-7, -6, -5, -4, -3, -2, -1, 0, 1), counts = c(2L, 
        2L, 4L, 5L, 14L, 22L, 110L, 3L), density = c(0.0123456790123457, 
        0.0123456790123457, 0.0246913580246914, 0.0308641975308642, 
        0.0864197530864197, 0.135802469135802, 0.679012345679012, 
        0.0185185185185185), mids = c(-6.5, -5.5, -4.5, -3.5, 
        -2.5, -1.5, -0.5, 0.5), xname = "x", equidist = TRUE), .Names = c("breaks", 
    "counts", "density", "mids", "xname", "equidist"), class = "histogram")), .Names = c("A", 
"B", "C"))

I have read this Fitting a density curve to a histogram in R but this is how to fit a curve to a histogram. what I want is Best-fit values"

" Mean" " SD"

If I use PRISM to do it, I should get the following results for A

Mids   Counts
-9.5    1
-8.5    0
-7.5    1
-6.5    5
-5.5    9
-4.5    38
-3.5    56
-2.5    105
-1.5    529
-0.5    2858
0.5     17
1.5     2
2.5     0
3.5     2

performing nonlinear regression Gaussian curve fitting , I get

"Best-fit values"   
"     Amplitude"    3537
"     Mean"       -0.751
"     SD"         0.3842

for the second set B

Mids   Counts
-6.5    2
-5.5    0
-4.5    6
-3.5    2
-2.5    2
-1.5    1
-0.5    3



"Best-fit values"   
"     Amplitude"    7.672
"     Mean"         -4.2
"     SD"          0.4275

and for the third one

Mids   Counts
-6.5    2
-5.5    2
-4.5    4
-3.5    5
-2.5    14
-1.5    22
-0.5    110
0.5      3

I get this

"Best-fit values"   
"     Amplitude"    120.7
"     Mean"       -0.6893
"     SD"        0.4397
Community
  • 1
  • 1
nik
  • 2,500
  • 5
  • 21
  • 48
  • If you are looking for the estimated mean and standard deviation / variance, I think this can be accomplished through a maximum likelihood procedure. There is the `mle` function in base R as well as the `maxLik` package. In this instance, you should use the raw data, rather than mids and counts. The first example in `mle` should be an analogue to what you want. – lmo Jun 28 '16 at 15:09
  • I can't watch videos at the moment but will give it a look in a couple of hours when I am able. It seems that estimating from binned data loses out on useful information. This is especially concerning given that you have such a small sample size: 16 I think. – lmo Jun 28 '16 at 15:17
  • @lmo Ok great, not really the sample size is much much higher like 1000. so would not be a problem in that case i think – nik Jun 28 '16 at 15:37
  • Do you have access to original data? If so, the above functions would probably be the way to go. I'll take a look at the video when I get a chance. – lmo Jun 28 '16 at 15:38
  • @lmo Yes I do, ok I'll wait for you – nik Jun 28 '16 at 16:36

1 Answers1

1

In order to convert the histogram back to the estimate of the mean and standard deviation. First convert the results of the bin counts times the bin. This will be an approximation of the original data.

Based on your example above:

#extract the mid points and create list of simulated data
simdata<-lapply(mylist, function(x){rep(x$mids, x$counts)})
#if the original data were integers then this may give a better estimate
#simdata<-lapply(mylist, function(x){rep(x$breaks[-1], x$counts)})

#find the mean and sd of simulated data
means<-lapply(simdata, mean)
sds<-lapply(simdata, sd)
#or use sapply in the above 2 lines depending on future process needs

If your data was integers then using the breaks as the bins will provide a better estimate. Depending on the function for the histogram (ie right=TRUE/FALSE) may shift the results by one.

Edit

I thought this was going to be an easy one. I reviewed the video, the sample data shown was:

mids<-seq(-7, 7)
counts<-c(7, 1, 2, 2, 2, 5, 217, 70, 18, 0, 2, 1, 2, 0, 1)
simdata<-rep(mids, counts)

The video results were mean = -0.7359 and sd= 0.4571. The solution which I found provided the closest results was using the "fitdistrplus" package:

fitdist(simdata, "norm", "mge")

Using the "maximizing goodness-of-fit estimation" resulted in mean = -0.7597280 and sd= 0.8320465.
At this point, the method above provides a close estimate but does not exactly match. I don't not know what technique was used to calculate the fit from the video.

Edit #2

The above solutions involved recreating the original data and fitting that using either the mean/sd or using the fitdistrplus package. This attempt is an attempt to perform a least-square fit using the Gaussian distribution.

simdata<-lapply(mylist, function(x){rep(x$mids, x$counts)})
means<-sapply(simdata, mean)
sds<-sapply(simdata, sd)

#Data from video
#mids<-seq(-7, 7)
#counts<-c(7, 1, 2, 2, 2, 5, 217, 70, 18, 0, 2, 1, 2, 0, 1)

#make list of the bins and distribution in each bin
mids<-lapply(mylist, function(x){x$mids})
dis<-lapply(mylist, function(x) {x$counts/sum(x$counts)})

#function to perform the least square fit
nnorm<-function(values, mids, dis) {
  means<-values[1]
  sds<-values[2]
  #print(paste(means, sds))
  #calculate out the Gaussian distribution for each bin
  modeld<-dnorm(mids, means, sds)  
  #sum of the squares
  diff<-sum( (modeld-dis)^2)
  diff
}

#use optim function with the mean and sd as initial guesses
#find the mininium with the mean and SD as fit parameters
lapply(1:3, function(i) {optim(c(means[[i]], sds[[i]]), nnorm, mids=mids[[i]], dis=dis[[i]])})

This solution provides a closer answer to PRISM results, but still not the same. Here is a comparison of all the 4 solutions. enter image description here

From the table, the least square fit (the one just above) provides the closest approximation. Maybe tweaking the mid points dnorm function might help. But Case B data is farthest from being normally distributed but the PRISM software still generates a small standard deviation, while the other methods are similar. It is possible the PRISM software performs some type of data filtering to remove the outliers before the fit.

Community
  • 1
  • 1
Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • are you sure that by doing this , one applies nonlinear regression Gaussian curve fit ??? – nik Jun 28 '16 at 16:17
  • hi, I checked the above data and I used PRISM to build the nonlinear regression Gaussian curve fitting and I obtained the average and standard deviation. Can you please see if it is the same ? – nik Jun 29 '16 at 09:29
  • The values don't match. I don't know how the PRISM software is performing the fit. It might be clipping or smoothing the tails for the fit. Your case B is not very normal yet PRISM is generating a standard deviation of <0.5 – Dave2e Jun 29 '16 at 13:08
  • I am doing the same as the vedio, but seems like you don't make any regression to find the best fit; for example you merge them and make random values simdata<-rep(mids, counts). I gave the X and Y example above, is it possible to make the same on those ? – nik Jun 29 '16 at 13:14
  • Third and final attempt – Dave2e Jun 30 '16 at 00:25
  • I do accept your answer and like it. you spent so much time on it and I really appreciate your help. In general I am wondering why should we do frequency distribution and then fit a least square ?? do you know the reason for this ? also did you see that they added 1.96 to sd +mean, do you a reason for this ? – nik Jun 30 '16 at 06:49
  • I not completely sure on the reasons why calculate the normal distribution and then fit the function. It could be related to the ease of filtering the outliers? Mean+ 1.96*sd is the statistical cut off, If the distribution is truly Gaussian then 97.5% of the data will fall below this value. – Dave2e Jun 30 '16 at 12:19