3

I need to do some robust data-fitting operation.

I have bunch of (x,y) data, that I want to fit to a Gaussian (aka normal) function. The point is, I want to remove the ouliers. As one can see on the sample plot below, there is another distribution of data thats pollutting my data on the right, and I don't want to take it into account to do the fitting (i.e. to find \sigma, \mu and the overall scale parameter). sample data plot

R seems to be the right tool for the job, I found some packages (robust, robustbase, MASS for example) that are related to robust fitting.

However, they assume the user already has a strong knowledge of R, which is not my case, and the documentation is only provided as a sort of reference manual, no tutorial or equivalent. My statistical background is rather low, I attempted to read reference material on fitting with R, but it didn't really help (and I'm not even sure thats the right way to go). But I have the feeling that this is actually a quite simple operation.

I have checked this related question (and the linked ones), however they take as input a single vector of values, and I have a vector of pairs, so I don't see how to transpose.

Any help on how to do this would be appreciated.

Community
  • 1
  • 1
kebs
  • 6,387
  • 4
  • 41
  • 70
  • I think the related question is about fitting a distribution as a density of one-dimensional data. What you've got is data {x,f(x)} and you want to fit the parameters of f(x), rather than estimate the parameters of a distribution. – Spacedman Apr 08 '13 at 15:04
  • Do you want to remove outliers or just fit a gaussian? – Nishanth Apr 08 '13 at 15:10
  • I'm also slightly concerned that your data points don't look like they have independent errors - seems to be four or five separate series. You should consider this in your methods... – Spacedman Apr 08 '13 at 15:13
  • @e4e5f4 : I want to get the parameters of the underlying Gaussian without taking into account the outliers, so to me it is not important: either removing them, then computing parameters (mean and stddev directly in that case), either use some robust fitting algorithm. – kebs Apr 08 '13 at 15:17
  • @Spacedman , first comment: yes, well... as you like ;-) But know how do I do this ? Data is loaded into R memory with read.table (2 columns)... and I'm stuck! – kebs Apr 08 '13 at 15:21
  • Can you post your data somewhere? – Spacedman Apr 08 '13 at 15:23
  • @Spacedman (second): Sorry, I don't see want you mean. The Gaussian is indeed quite noisy, but it comes from a dataset of limited size, I have others that are much more "gaussian-like". Except they all have those nasty outliers at the right, I *know* were they come from (experimental artifacts). – kebs Apr 08 '13 at 15:23
  • @Spacedman Sure! http://pastebin.com/jjgzcaT8 . It was loaded in R with d1<-read.table("datafile.dat", colClasses=c("NULL","NULL", NA, "NULL", "NULL", NA )) (only third and sixth columns useful here) – kebs Apr 08 '13 at 15:26
  • If you have a model for the "outliers" then you could incorporate them into the fitted function. It could be two Gaussians, for example, or the end of an exponential tail for the noise plus your Gaussian signal. – Spacedman Apr 08 '13 at 15:27

2 Answers2

8

Fitting a Gaussian curve to the data, the principle is to minimise the sum of squares difference between the fitted curve and the data, so we define f our objective function and run optim on it:

fitG =
function(x,y,mu,sig,scale){

  f = function(p){
    d = p[3]*dnorm(x,mean=p[1],sd=p[2])
    sum((d-y)^2)
  }

  optim(c(mu,sig,scale),f)
 }

Now, extend this to two Gaussians:

fit2G <- function(x,y,mu1,sig1,scale1,mu2,sig2,scale2,...){

  f = function(p){
    d = p[3]*dnorm(x,mean=p[1],sd=p[2]) + p[6]*dnorm(x,mean=p[4],sd=p[5])
    sum((d-y)^2)
  }
  optim(c(mu1,sig1,scale1,mu2,sig2,scale2),f,...)
}

Fit with initial params from the first fit, and an eyeballed guess of the second peak. Need to increase the max iterations:

> fit2P = fit2G(data$V3,data$V6,6,.6,.02,8.3,0.10,.002,control=list(maxit=10000))
Warning messages:
1: In dnorm(x, mean = p[1], sd = p[2]) : NaNs produced
2: In dnorm(x, mean = p[4], sd = p[5]) : NaNs produced
3: In dnorm(x, mean = p[4], sd = p[5]) : NaNs produced
> fit2P
$par
[1] 6.035610393 0.653149616 0.023744876 8.317215066 0.107767881 0.002055287

What does this all look like?

> plot(data$V3,data$V6)
> p = fit2P$par
> lines(data$V3,p[3]*dnorm(data$V3,p[1],p[2]))
> lines(data$V3,p[6]*dnorm(data$V3,p[4],p[5]),col=2)

enter image description here

However I would be wary about statistical inference about your function parameters...

The warning messages produced are probably due to the sd parameter going negative. You can fix this and also get a quicker convergence by using L-BFGS-B and setting a lower bound:

> fit2P = fit2G(data$V3,data$V6,6,.6,.02,8.3,0.10,.002,control=list(maxit=10000),method="L-BFGS-B",lower=c(0,0,0,0,0,0))
> fit2P
$par
[1] 6.03564202 0.65302676 0.02374196 8.31424025 0.11117534 0.00208724

As pointed out, sensitivity to initial values is always a problem with curve fitting things like this.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • Great! This gives exactly what I was looking for (even more, as it also gives the "noise" parameters). I don't understand exactly all the "R" steps, but I will study that in detail, thank you very much for such a clear and precise answer! And I doubt I would have achieved this before weeks, thanks SO too. – kebs Apr 08 '13 at 22:25
  • Just one more point (for the future reader), this method is quite sensible to the initial values given, so a preliminar approximate value must be searched for. – kebs Apr 09 '13 at 09:27
4

Fitting a Gaussian:

# your data
set.seed(0)
data <- c(rnorm(100,0,1), 10, 11) 

# find & remove outliers
outliers <- boxplot(data)$out
data <- setdiff(data, outliers)

# fitting a Gaussian
mu <- mean(data)
sigma <- sd(data)

# testing the fit, check the p-value
reference.data <- rnorm(length(data), mu, sigma)
ks.test(reference.data, data) 
Nishanth
  • 6,932
  • 5
  • 26
  • 38