0

I'm trying to figure out the probability under a normal distribution in for my data df in python. I'm not experienced with python or programming. The following user-defined function I scraped from this site works, the scipy function does not work...

UDF:

def normal(x,mu,sigma):
    return ( 2.*np.pi*sigma**2. )**-.5 * np.exp( -.5 * (x-mu)**2. / sigma**2. )
df["normprob"] = normal(df["return"],df["meanreturn"],df["sdreturn"])

This scipy function does not work:

df["normdistprob"] = scip.norm.sf(df["return"],df["meanreturn"],df["sdreturn"])

and it returns the following error

C:\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1815: RuntimeWarning: invalid value encountered in true_divide
  x = np.asarray((x - loc)/scale, dtype=dtyp)
C:\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1816: RuntimeWarning: invalid value encountered in greater
  cond0 = self._argcheck(*args) & (scale > 0)
C:\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
C:\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
C:\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1817: RuntimeWarning: invalid value encountered in greater
  cond1 = self._open_support_mask(x) & (scale > 0)
C:\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1818: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)

Any advice is appreciated. Also to note, the first 20 cells of

df["meanreturn"]

are NA, not sure if that's affecting it.

Netwave
  • 40,134
  • 6
  • 50
  • 93
JoeJack
  • 63
  • 1
  • 11
  • yeah, having NA in any math calculation will make it to crash – Netwave Feb 01 '18 at 08:32
  • What is your intended way of calculating the probability if the mean is NA? – nnnmmm Feb 01 '18 at 08:34
  • Okay, I thought even though it was the first 20 cells, that wouldn't affect the rest of the dataset, and the first 20 cells of 'df["normdist"]' would simply be NaN as well. Also, from this link https://stackoverflow.com/questions/25039328/specifying-skip-na-when-calculating-mean-of-the-column-in-a-data-frame-created, it seems that the NaN cells wouldn't matter? – JoeJack Feb 01 '18 at 08:37

1 Answers1

0

Not sure if the survival function is what you need. I believe what you're looking for is scipy's pdf function, specifically the pdf for a normal random variable. I tested it against the custom function you used.

>>> from scipy.stats import norm
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'x': [0.6, 0.5, 0.13], 'mu': [0, 1, 1], 'std': [1, 2, 1]})
>>> norm.pdf(df['x'], df['mu'], df['std'])
array([ 0.3332246 ,  0.19333406,  0.27324443])
>>> def normal(x,mu,sigma):
...     return ( 2.*np.pi*sigma**2. )**-.5 * np.exp( -.5 * (x-mu)**2. / sigma**2. )
...
>>> normal(df['x'], df['mu'], df['std'])
0    0.333225
1    0.193334
2    0.273244
dtype: float64

Note that if your mu and std columns are np.nan, then you will get the runtime warnings, but you will still get an output, similar to the custom function.

>>> df = pd.DataFrame({'x': [0.6, 0.5, 0.13], 'mu': [np.nan, 1, 1], 'std': [np.nan, 2, np.nan]})
>>> norm.pdf(df['x'], df['mu'], df['std'])
C:\Users\lyang3\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1650: RuntimeWarning: invalid value encountered in greater
  cond0 = self._argcheck(*args) & (scale > 0)
C:\Users\lyang3\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:876: RuntimeWarning: invalid value encountered in greater_equal
  return (self.a <= x) & (x <= self.b)
C:\Users\lyang3\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:876: RuntimeWarning: invalid value encountered in less_equal
  return (self.a <= x) & (x <= self.b)
C:\Users\lyang3\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1651: RuntimeWarning: invalid value encountered in greater
  cond1 = self._support_mask(x) & (scale > 0)
array([        nan,  0.19333406,         nan])
>>> normal(df['x'], df['mu'], df['std'])
0         NaN
1    0.193334
2         NaN
dtype: float64

You could avoid the warnings if you set your np.nan values to None:

>>> df = pd.DataFrame({'x': [0.6, 0.5, 0.13], 'mu': [None, 1, 1], 'std': [None, 2, None]})
>>> normal(df['x'], df['mu'], df['std'])
0         NaN
1    0.193334
2         NaN
dtype: float64
>>> norm.pdf(df['x'], df['mu'], df['std'])
array([        nan,  0.19333406,         nan])

Note, I would either remove rows where your meanreturn and sdreturn values are NaN. Otherwise, I would make the assumption that you are looking for the probability of x assuming a standard normal distribution, which you would then have to set the NaN values of meanreturn to 0 and NaN values of sdreturn to 1.

One last comment to add is that if all the rows of your dataframe assume a standard normal distribution for calculating the probability from the pdf, then you don't need to pass the mu column and std column. norm.pdf already assumes a standard normal. In this case, you can just run your code like so:

>>> norm.pdf(df['x'])
array([ 0.3332246 ,  0.35206533,  0.39558542])
Scratch'N'Purr
  • 9,959
  • 2
  • 35
  • 51