0

I have a dataframe with the x (column x) and y (column 1) values below I am getting the mean and stdev.

Next I am plotting them together on one chart, but it just looks very wrong, It is not just that the fitted curve is shifted, I am not sure what is wrong with it.

import matplotlib.pyplot as plt
from scipy import stats
from scipy import optimize
import numpy as np

data_sample = {'x': [0,1,2,3,4,5,6,7,8,9,10], '1': [0,1,2,3,4,5,4,3,2,1,0]}  
def test_func(x, a, b): 
    return stats.norm.pdf(x,a,b)

params, cov_params = optimize.curve_fit(test_func, data_sample['x'], data_sample['1'])

print(params)

plt.scatter(data_sample['x'], data_sample['1'], label='Data')
plt.plot(data_sample['x'] , test_func(data_sample['x'], params[0], params[1]), label='Fitted function')

plt.legend(loc='best')

plt.show()

enter image description here

JohanC
  • 71,591
  • 8
  • 33
  • 66
csuzzanna
  • 29
  • 5
  • Welcome to Stack Overflow. Please provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). In particular the data you are trying to fit. For Pandas please review [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – piterbarg Jan 20 '22 at 20:30
  • Make sure the starting parameters for the curve fit are somewhat reasonable. The defaults are 1 for all parameters, which, as shown, may lead to curve_fit not finding the a good minimum. – 9769953 Jan 20 '22 at 20:56
  • From your figure, I'd say a combination of two normal distributions is probably a better fit, but that's another problem. – 9769953 Jan 20 '22 at 20:57
  • 1
    Okay I provided the minimal reproducible example! – csuzzanna Jan 20 '22 at 20:57
  • `params` is used before it exists. And `a, b = optimize.curve_fit...` does not result in what you (likely) think it does. – 9769953 Jan 20 '22 at 21:00
  • 1
    With the current code, corrected for the unassigned `params` and incorrectly assigned `a, b`, I get a correct fit, and can't reproduce your problem. – 9769953 Jan 20 '22 at 21:01

1 Answers1

1

The data needs to be normalized such that the area under the curve is 1. To calculate the area, when all x-values are 1 apart, you need the sum of the y-values. If the space between the x-values is larger or smaller than 1, that factor should also be included. Another way to calculate the area is np.trapz().

The normalization factor needs to be used when doing the fit. And the reverse needs to happen when drawing the curve with the original data.

When you try to fit the Gaussian pdf function to non-normalized points, the "best" fit is a very narrow, very high peak. This peak tries to approach the y=5 value in the center.

The example code below converts the lists to numpy arrays, so functions can be written more easily. Also, to draw a smooth curve, more detailed x-values are used.

import matplotlib.pyplot as plt
from scipy import stats
from scipy import optimize
import numpy as np

def test_func(x, a, b):
    return stats.norm.pdf(x, a, b)

data_sample = {'x': np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
               '1': np.array([0, 1, 2, 3, 4, 5, 4, 3, 2, 1, 0])}

# x_dist = (data_sample['x'].max() - data_sample['x'].min()) / (len(data_sample['x']) - 1)
# normalization_factor = sum(data_sample['1']) * x_dist
normalization_factor = np.trapz(data_sample['1'], data_sample['x'])  # area under the curve
params, pcov = optimize.curve_fit(test_func, data_sample['x'], data_sample['1'] / normalization_factor)

plt.scatter(data_sample['x'], data_sample['1'], clip_on=False, label='Data')
x_detailed = np.linspace(data_sample['x'].min() - 3, data_sample['x'].max() + 3, 200)
plt.plot(x_detailed, test_func(x_detailed, params[0], params[1]) * normalization_factor,
         color='crimson', label='Fitted function')

plt.legend(loc='best')
plt.margins(x=0)
plt.ylim(ymin=0)
plt.tight_layout()
plt.show()

fitting a normal curve to some points

PS: Using the original code (without the normalization), but with more detailed x values, the narrow curve would be more apparent:

x_detailed = np.linspace(min(data_sample['x']) - 1, max(data_sample['x']) + 1, 500)
plt.plot(x_detailed, test_func(x_detailed, params[0], params[1]), color='m', label='Fitted function')

narrow gauss curve for non-normalized data

JohanC
  • 71,591
  • 8
  • 33
  • 66