I wanted to create a function that takes a df and col and returns a histogram with a normal curve and some labeling. Something that I can use and customize as I see fit for future data (will appreciate any recommendations to make it more customizable). This was made for the kaggle titanic training set, if needed, please download from here. This function is working fine for columns without NaN
values. Column Age
has NaN
, which is what I think is throwing the error. I tried to ignore NaN
using Error when plotting DataFrame containing NaN with Pandas 0.12.0 and Matplotlib 1.3.1 on Python 3.3.2 where one of the solutions recommends using subplot
, but it doesn't work for me; the accepted solution is downgrading matplotlib
(my version is '2.1.2', python is 3.6.4). This pylab histogram get rid of nan uses an interesting method which I am not able to apply in my case. How to remove the NaN
? Is this function customizable? Not primary question - Can I neatly do stuff like round mean/std, add more information?
import numpy as np
import pandas as pd
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
mydf = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
def df_col_hist (df,col, n_bins):
fig, ax = plt.subplots()
n, bins, patches = ax.hist(df[col], n_bins, normed=1)
y = mlab.normpdf(bins, df[col].mean(), df[col].std())
ax.plot(bins, y, '--')
ax.set_xlabel (df[col].name)
ax.set_ylabel('Probability density')
ax.set_title(f'Histogram of {df[col].name}: $\mu={df[col].mean()}$, $\sigma={df[col].std()}$')
fig.tight_layout()
plt.show()
df_col_hist (train_data, 'Fare', 100)
#Works Fine, Tidy little histogram.
df_col_hist (train_data, 'Age', 100)
#ValueError: max must be larger than min in range parameter.
..\Anaconda3\lib\site-packages\numpy\core\_methods.py:29: RuntimeWarning: invalid value encountered in reduce
return umr_minimum(a, axis, None, out, keepdims)
..\Anaconda3\lib\site-packages\numpy\core\_methods.py:26: RuntimeWarning: invalid value encountered in reduce
return umr_maximum(a, axis, None, out, keepdims)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-75-c81b76c1f28e> in <module>()
----> 1 df_col_hist (train_data, 'Age', 100)
<ipython-input-70-1cf1645db595> in df_col_hist(df, col, n_bins)
2
3 fig, ax = plt.subplots()
----> 4 n, bins, patches = ax.hist(df[col], n_bins, normed=1)
5
6 y = mlab.normpdf(bins, df[col].mean(), df[col].std())
~\Anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, *args, **kwargs)
1715 warnings.warn(msg % (label_namer, func.__name__),
1716 RuntimeWarning, stacklevel=2)
-> 1717 return func(ax, *args, **kwargs)
1718 pre_doc = inner.__doc__
1719 if pre_doc is None:
~\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in hist(***failed resolving arguments***)
6163 # this will automatically overwrite bins,
6164 # so that each histogram uses the same bins
-> 6165 m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
6166 m = m.astype(float) # causes problems later if it's an int
6167 if mlast is None:
~\Anaconda3\lib\site-packages\numpy\lib\function_base.py in histogram(a, bins, range, normed, weights, density)
665 if first_edge > last_edge:
666 raise ValueError(
--> 667 'max must be larger than min in range parameter.')
668 if not np.all(np.isfinite([first_edge, last_edge])):
669 raise ValueError(