I'm trying to reproduce the example given in this answer to fit the distribution of my real data, that for instance df['Note']
is one column of my dataframe, that contains the average note for every student:
Index StudentName Note
0 Mark 3.7
1 Bryan 3.5
2 Nil 0.1
3 Amanda 2.045
4 Arthur 1.2
5 Helen 2.5
The real dataframe contains about 350000 of rows, with the mean = 2.17, as in this one
SO my code to obtain the fitted distribution
def fit(dataframe, path):
dataframe.set_index
size=len(dataframe.index)
x=dataframe.index
y=dataframe['Note']
plt.hist(y, bins=range(20))
dist_names = ['alpha', 'beta', 'norm', 'expon']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
plt.plot(pdf_fitted, label=dist_name)
plt.xlim(0, 19)
plt.legend(loc='upper left')
plt.show()
plt.savefig(path+'_fit.png', bbox_inches='tight', dpi=100)
def call_fit(pathname):
path_picture="%spicture//" %pathname
path="%data//" %pathname
path_s=sorted(os.listdir(path))
for i in path_s:
file_path=os.path.join(path, i)
picture=os.path.join(path_picture, i)
df=pd.read_csv(file_path, sep='\t')
fit(df, picture)
but the output is very strange. I put here two functions, because I'm no more sure which is cause of the bug! Needless to say, that the distribution have a strange appearance, but moreover the histogram of my real distribution wasn't plot, but every iteration something strange happens with legend: have a look and compare the first plot and the 10th one:
your help and advices will be very appreciated! because I have no idea where I could be wrong