4

In SciPy one can implement a beta distribution as follows:

x=640495496
alpha=1.5017096
beta=628.110247
A=0
B=148000000000 
p = scipy.stats.beta.cdf(x, alpha, beta, loc=A, scale=B-A)

Now, suppose I have a Pandas dataframe with the columns x,alpha,beta,A,B. How do I apply the beta distribution to each row, appending the result as a new column?

Cameron
  • 229
  • 1
  • 5
  • 17

2 Answers2

2

Need apply with function scipy.stats.beta.cdf and axis=1:

df['p'] = df.apply(lambda x:  scipy.stats.beta.cdf(x['x'], 
                                                   x['alpha'], 
                                                   x['beta'], 
                                                   loc=x['A'], 
                                                   scale=x['B']-x['A']), axis=1)

Sample:

import scipy.stats

df = pd.DataFrame({'x':[640495496, 640495440],
                   'alpha':[1.5017096,1.5017045],
                   'beta':[628.110247, 620.110],
                   'A':[0,0],
                   'B':[148000000000,148000000000]})
print (df)
   A             B     alpha        beta          x
0  0  148000000000  1.501710  628.110247  640495496
1  0  148000000000  1.501704  620.110000  640495440

df['p'] = df.apply(lambda x:  scipy.stats.beta.cdf(x['x'], 
                                                   x['alpha'], 
                                                   x['beta'], 
                                                   loc=x['A'], 
                                                   scale=x['B']-x['A']), axis=1)
print (df)
   A             B     alpha        beta          x         p
0  0  148000000000  1.501710  628.110247  640495496  0.858060
1  0  148000000000  1.501704  620.110000  640495440  0.853758
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I've imported scipy, but it is returning an error when I use apply: NameError: ("global name 'scipy' is not defined", u'occurred at index 0') – Cameron May 22 '17 at 12:30
  • 1
    Using *only* `import scipy` will not import `scipy.stats`. To use `scipy.stats`, you must use `import scipy.stats`. – Warren Weckesser May 22 '17 at 16:06
  • Yes, to clarify, I am using import scipy.stats, but it still doesn't seem to be working. The answer bellow, however, does work. – Cameron May 23 '17 at 08:51
  • Hmmm, I test it in `python 3` in spyder and it works for me. But maybe I was wrong. – jezrael May 23 '17 at 08:55
  • @Cameron - solution of `Warren Weckesser` does not work for you? `import scipy.stats` ? – jezrael May 23 '17 at 09:16
  • @jezrael Can you please explain What is X, A, B, are they columns of a dataframe or they are derived from raw data?? also do we have to compute the values of 'alpha' & 'beta' for each record before applying beta function() ? – User1011 Mar 14 '22 at 12:39
2

Given that I suspect that pandas apply is just looping over all rows, and the scipy.stats distributions have quite a bit of overhead in each call, I would use a vectorized version:

>>> from scipy import stats
>>> df['p'] = stats.beta.cdf(df['x'], df['alpha'], df['beta'], loc=df['A'], scale=df['B']-df['A'])
>>> df
   A             B     alpha        beta          x         p
0  0  148000000000  1.501710  628.110247  640495496  0.858060
1  0  148000000000  1.501704  620.110000  640495440  0.853758
Josef
  • 21,998
  • 3
  • 54
  • 67