Gaussian Fit on noisy and 'interesting' data set

Question

I have some data (X-ray diffraction) that looks like this:

I want to fit a Gaussian to this data set to get the FWHM of the 'wider' portion. The double peak at around 7 degrees theta is not important information and coming from unwanted sources.

To make myself more clear I want something like this (which I made in paint :) ):

I have tried to script something in python with the following code:

import math
from pylab import *
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
data2=np.loadtxt('FWHM.spc')    
x2,y2=data2[:,0],data2[:,7]
plt.title('Full Width Half Max of 002 Peak')
plt.plot(x2, y2, color='b')
plt.xlabel('$\\theta$', fontsize=10)
plt.ylabel('Intensity', fontsize=10)
plt.xlim([3,11])
plt.xticks(np.arange(3, 12, 1), fontsize=10)
plt.yticks(fontsize=10)

def func(x, a, x0, sigma):
    return a*np.exp(-(x-x0)**2/(2*sigma**2))

mean = sum(x2*y2)/sum(y2)
sigma2 = sqrt(abs(sum((x2-mean)**2*y2)/sum(y2)))
popt, pcov = curve_fit(func, x2, y2, p0 = [1, mean, sigma2])
ym = func(x2, popt[0], popt[1], popt[2])
plt.plot(x2, ym, c='r', label='Best fit')
FWHM = round(2*np.sqrt(2*np.log(2))*popt[2],4)
axvspan(popt[1]-FWHM/2, popt[1]+FWHM/2, facecolor='g', alpha=0.3, label='FWHM = %s'%(FWHM))
plt.legend(fontsize=10)
plt.show()

and I get the following output:

Clearly, this is way off from what is desired. Does anyone have some tips how I can acheive this?

(I have enclosed the data here: https://justpaste.it/109qp)

Suggestion: Change `p0` to `[650, mean, sigma2]` and plot the initial guess on the last plot. It would help me visualize the problem better: `plt.plot(x2, func(x2, 650, mean, sigma2), c='r', linestyle=':', label='Initial Guess')` — Mad Physicist, Nov 10 '16 at 19:12
Well, clearly you're fitting the wrong function. Why don't you try to fit a more complex model? It looks like your spectrum can be modelled as a sum of two narrow lines, a broad one, and some baseline level that can be either removed or fit separately on the edge data points. — Vlas Sokolov, Nov 10 '16 at 19:14
Also, is there a consistent way to identify the discontinuity? E.g., can we say for example that data points whose y-value differs by more than 100 from the y-value of the previous point is trash for the purposes of this fit? — Mad Physicist, Nov 10 '16 at 19:14
@VlasSokolov. OP seems to be asking for a way to filter out the information from unwanted sources, in which case a Gaussian fit will work just fine (possibly with the addition of a simple baseline). — Mad Physicist, Nov 10 '16 at 19:17
@MadPhysicist sure, the narrow "contamination" peak could alternatively be masked out similarly to what you say. But IMO fitting a more complex model and taking the fit for the narrow component might be easier than flagging the data by arbitrary heuristics. — Vlas Sokolov, Nov 10 '16 at 19:26
@VlasSokolov. That depends. OP says `The double peak at around 7 degrees theta is not important information and coming from unwanted sources.` From a scientific standpoint, we'd want to get rid of this data before doing the analysis. — Mad Physicist, Nov 10 '16 at 19:42
I am assuming OP does not want to take the time to model this peak at all based on the sentence I quoted above. — Mad Physicist, Nov 10 '16 at 19:42
@MadPhysicist I agree with the general argument, as long as discarding parts of the data is well justified. My preference to model the unwanted signal comes from reluctance to draw a line between "solid" and "unwanted" data. Luckily for me, the unwanted sources fit the double gaussian model perfectly (answer incoming), and the `scipy` implementation for a composite fit is very straightforward. — Vlas Sokolov, Nov 10 '16 at 20:09
@VlasSokolov. I am looking forward to seeing how you do the composite fit. — Mad Physicist, Nov 10 '16 at 20:13

score 13 · Accepted Answer · edited May 23 '17 at 12:01

As mentioned in the OP comments, one of the ways to constrain a signal in presence of unwanted data is to model it together with the desired signal. Of course, this approach is valid only when there is a valid model readily available for those contaminating data. For the data that you provide, one can consider a composite model that sums over the following components:

A linear baseline, because all your points are constantly offset from zero.
Two narrow Gaussian components that will model the double-peaked feature at the central part of your spectrum.
A narrow Gaussian component. This is the one you're actually trying to constrain.

All four components (double peak counts twice) can be fit simultaneusly once you pass a reasonable starting guess to curve_fit:

def composite_spectrum(x, # data
                       a, b, # linear baseline
                       a1, x01, sigma1, # 1st line
                       a2, x02, sigma2, # 2nd line
                       a3, x03, sigma3 ): # 3rd line
    return (x*a + b + func(x, a1, x01, sigma1)
                    + func(x, a2, x02, sigma2)
                    + func(x, a3, x03, sigma3))

guess = [1, 200, 1000, 7, 0.05, 1000, 6.85, 0.05, 400, 7, 0.6]

popt, pcov = curve_fit(composite_spectrum, x2, y2, p0 = guess)
plt.plot(x2, composite_spectrum(x2, *popt), 'k', label='Total fit')
plt.plot(x2, func(x2, *popt[-3:])+x2*popt[0]+popt[1], c='r', label='Broad component')
FWHM = round(2*np.sqrt(2*np.log(2))*popt[10],4)
plt.axvspan(popt[9]-FWHM/2, popt[9]+FWHM/2, facecolor='g', alpha=0.3, label='FWHM = %s'%(FWHM))
plt.legend(fontsize=10)
plt.show()

In a case when the unwanted sources can not be modeled properly, the unwanted discontinuity can be masked out as suggested by Mad Physicist. For a simplest case you can even simply mask out eveything inside [6.5; 7.4] interval.

One should add that gaussians are usually not ideal do describe the diffraction profiles. In principle they have lorentzian shape which experiences a gaussian broadening due to the experiment setup. Typically pseudo Voigt profiles are used for the sake of calculation speed. This kind of function can also produce similar slopes as seen in the data. — dnalow, Nov 11 '16 at 22:51
@dnalow Are you saying I should fit a lorentzian to my curve? This will not really affect the FWHM -- or? — sci-guy, Nov 12 '16 at 12:57
I think it is the norm to use pseudo voigt functions for diffraction (a mix of gaussian and lorentzian). The FWHM can in principle be affected. But they are used to get a better description of the diffraction patterns, since the pure gaussian often decays too fast to describe the slopes. It is just a remark. Anyway, if you are ONLY interested in the FWHM, there are more direct, simpler and faster ways to get it than fitting the profile. — dnalow, Nov 12 '16 at 15:11

Gaussian Fit on noisy and 'interesting' data set

1 Answers1