I have this data set:
ItemNumber Successes Trials Prob
15 14 95 0.047
9625 20 135 0.047
19 14 147 0.047
24 12 120 0.047
20 15 133 0.047
22 8 91 0.047
9619 16 131 0.047
10006 8 132 0.047
25 15 127 0.047
I want to identify a culmulative binomial distribution p value for each item, to understand the probability of observing an equal or higher number of item occurrences.
I used this code:
import sys
import scipy
from scipy.stats.distributions import binom
import sys
for line in open(sys.argv[1], 'r').readlines():
line = line.strip().split()
Item,num_succ,num_trials,prob = line[0],int(line[1]),int(line[2]),float(line[3])
print Item + "\t" + str(num_succ) + "\t" + str(num_trials) + "\t" + str(prob) + "\t" + str(1 - (binom.cdf(num_succ, num_trials, prob)))
The output looks like this:
Item NumSucc NumTrials Prob Binomial
15 14 95 0.047 3.73e-05
9625 20 135 0.047 1.48e-06
19 14 147 0.047 0.004
24 12 120 0.047 0.0043
20 15 133 0.047 0.00054
22 8 91 0.047 0.027
9619 16 131 0.047 0.0001
10006 8 132 0.047 0.169
25 15 127 0.047 0.0003
The problem: When I pick a line and check the obtained cumulative binomial p value against an online tool like this: http://stattrek.com/online-calculator/binomial.aspx , the results are not the same.
For example,
For item 20 (# successes = 15, # trials = 133, Prob = 0.047):
My Binomial P Val = 0.00054
StatTrek P Val = 0.0015
However, I can see from StatTrek, that what I have looked up is the Culmulative Probability: P(X>15), but since I want "equal to or greater than", what I actually want to calculate is the P(X>=15) (which is 0.0015).
I'm struggling to correctly edit the above code, to change the P value returned from "find number of incidences greater than" to "find the number of incidences greater than or equal to". If someone could demonstrate I would appreciate it. If you look at this question, I was trying to follow Volodymyr's comment.