0

I have this data set:

ItemNumber  Successes   Trials    Prob
15          14           95       0.047
9625        20           135      0.047
19          14           147      0.047
24          12           120      0.047
20          15           133      0.047
22          8            91       0.047
9619        16           131      0.047
10006       8            132      0.047
25          15           127      0.047

I want to identify a culmulative binomial distribution p value for each item, to understand the probability of observing an equal or higher number of item occurrences.

I used this code:

import sys
import scipy
from scipy.stats.distributions import binom
import sys

for line in open(sys.argv[1], 'r').readlines():
    line = line.strip().split()
    Item,num_succ,num_trials,prob = line[0],int(line[1]),int(line[2]),float(line[3])
    print Item + "\t" + str(num_succ) + "\t" + str(num_trials) + "\t" + str(prob) + "\t" + str(1 - (binom.cdf(num_succ, num_trials, prob)))

The output looks like this:

Item    NumSucc NumTrials   Prob    Binomial
15      14      95         0.047    3.73e-05
9625    20      135        0.047    1.48e-06
19      14      147        0.047    0.004
24      12      120        0.047    0.0043
20      15      133        0.047    0.00054
22      8       91         0.047    0.027
9619    16      131        0.047    0.0001
10006   8       132        0.047    0.169
25      15      127        0.047    0.0003

The problem: When I pick a line and check the obtained cumulative binomial p value against an online tool like this: http://stattrek.com/online-calculator/binomial.aspx , the results are not the same.

For example,

For item 20 (# successes = 15, # trials = 133, Prob = 0.047):

My Binomial P Val = 0.00054
StatTrek P Val = 0.0015

However, I can see from StatTrek, that what I have looked up is the Culmulative Probability: P(X>15), but since I want "equal to or greater than", what I actually want to calculate is the P(X>=15) (which is 0.0015).

I'm struggling to correctly edit the above code, to change the P value returned from "find number of incidences greater than" to "find the number of incidences greater than or equal to". If someone could demonstrate I would appreciate it. If you look at this question, I was trying to follow Volodymyr's comment.

Community
  • 1
  • 1
TomRyan
  • 11
  • 5

2 Answers2

0

The binominal distribution is a discrete distribution. Therefore the following is true P(X>14) = P(X>=15).

So if binom.cdf calculates a probability for P(X > N) (does it? i did not found the documentation for it) you have to change it to P(X > N - 1) if you want to test for P(X >= N).

Hatatister
  • 962
  • 6
  • 11
0

If you want to calculate the p_value for each record use this code which is way easier:

#alternative : {‘two-sided’, ‘greater’, ‘less’},
from scipy.stats import binom_test
binom_test(x= number_of_occurance, n = number_of_trail, p= probability, alternative= 'greater')
Afshin Amiri
  • 3,438
  • 1
  • 20
  • 21