2

I wanted to create a function that takes a df and col and returns a histogram with a normal curve and some labeling. Something that I can use and customize as I see fit for future data (will appreciate any recommendations to make it more customizable). This was made for the kaggle titanic training set, if needed, please download from here. This function is working fine for columns without NaN values. Column Age has NaN, which is what I think is throwing the error. I tried to ignore NaN using Error when plotting DataFrame containing NaN with Pandas 0.12.0 and Matplotlib 1.3.1 on Python 3.3.2 where one of the solutions recommends using subplot, but it doesn't work for me; the accepted solution is downgrading matplotlib (my version is '2.1.2', python is 3.6.4). This pylab histogram get rid of nan uses an interesting method which I am not able to apply in my case. How to remove the NaN ? Is this function customizable? Not primary question - Can I neatly do stuff like round mean/std, add more information?

import numpy as np
import pandas as pd
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
mydf = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

def df_col_hist (df,col, n_bins):

    fig, ax = plt.subplots()
    n, bins, patches = ax.hist(df[col], n_bins, normed=1)

    y = mlab.normpdf(bins, df[col].mean(), df[col].std())
    ax.plot(bins, y, '--')

    ax.set_xlabel (df[col].name)
    ax.set_ylabel('Probability density')
    ax.set_title(f'Histogram of {df[col].name}: $\mu={df[col].mean()}$, $\sigma={df[col].std()}$')

    fig.tight_layout()
    plt.show()

df_col_hist (train_data, 'Fare', 100)
#Works Fine, Tidy little histogram. 

df_col_hist (train_data, 'Age', 100)
#ValueError: max must be larger than min in range parameter.

    ..\Anaconda3\lib\site-packages\numpy\core\_methods.py:29: RuntimeWarning: invalid value encountered in reduce
  return umr_minimum(a, axis, None, out, keepdims)
..\Anaconda3\lib\site-packages\numpy\core\_methods.py:26: RuntimeWarning: invalid value encountered in reduce
  return umr_maximum(a, axis, None, out, keepdims)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-75-c81b76c1f28e> in <module>()
----> 1 df_col_hist (train_data, 'Age', 100)

<ipython-input-70-1cf1645db595> in df_col_hist(df, col, n_bins)
      2 
      3     fig, ax = plt.subplots()
----> 4     n, bins, patches = ax.hist(df[col], n_bins, normed=1)
      5 
      6     y = mlab.normpdf(bins, df[col].mean(), df[col].std())

~\Anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, *args, **kwargs)
   1715                     warnings.warn(msg % (label_namer, func.__name__),
   1716                                   RuntimeWarning, stacklevel=2)
-> 1717             return func(ax, *args, **kwargs)
   1718         pre_doc = inner.__doc__
   1719         if pre_doc is None:

~\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in hist(***failed resolving arguments***)
   6163             # this will automatically overwrite bins,
   6164             # so that each histogram uses the same bins
-> 6165             m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
   6166             m = m.astype(float)  # causes problems later if it's an int
   6167             if mlast is None:

~\Anaconda3\lib\site-packages\numpy\lib\function_base.py in histogram(a, bins, range, normed, weights, density)
    665     if first_edge > last_edge:
    666         raise ValueError(
--> 667             'max must be larger than min in range parameter.')
    668     if not np.all(np.isfinite([first_edge, last_edge])):
    669         raise ValueError(
pyeR_biz
  • 986
  • 12
  • 36

1 Answers1

1

Your call to normpdfis wrong, as it expects an array of x-values as first parameter, not the number of bins. But anyway, mlab.normpdf is deprecated afaik.

That said, I'd recommend to use norm.pdf from scipy:

from scipy.stats import norm

s = np.std(df[col])
m = df[col].mean()
x = np.linspace(m - 3*s, m + 3*s, 51)
y = norm.pdf(x, loc=m)   # additionally there's a `scale` parameter for norming against whatever in y-direction

ax.plot(x, y, '--', label='probability density function')

PS: For dropping nan in a pandas dataframe you have

df[col].dropna()

ie:

n, bins, patches = ax.hist(df[col].dropna(), n_bins, normed=1)
SpghttCd
  • 10,510
  • 2
  • 20
  • 25