Overlapping probability of two normal distribution with scipy

Question

i have two scipy.stats.norm(mean, std).pdf(0) normal distribution curve and i am trying to find out the differential (overlapping) of the two curves.

How do i calculate it with scipy in Python? Thanks

Have you seen this: http://stackoverflow.com/questions/22579434/python-finding-the-intersection-point-of-two-gaussian-curves — duhaime, Sep 13 '15 at 16:47
yes, but thats just to find intersection points right? i trying to find the whole area like those i see about Overlapping coefficients (OVL) — desmond, Sep 13 '15 at 17:57
This question is similar but not restricted to Gaussians: https://stackoverflow.com/questions/20381672/calculate-overlap-area-of-two-functions — Gabriel, Sep 04 '17 at 15:14

Ed Smith · Accepted Answer · 2022-01-21T21:55:11.660

You can use the answer. suggested by @duhalme to get the intersect and then use this point to define the range of integral limits,

Where the code for this looks like,

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
norm.cdf(1.96)

def solve(m1,m2,std1,std2):
  a = 1/(2*std1**2) - 1/(2*std2**2)
  b = m2/(std2**2) - m1/(std1**2)
  c = m1**2 /(2*std1**2) - m2**2 / (2*std2**2) - np.log(std2/std1)
  return np.roots([a,b,c])

m1 = 2.5
std1 = 1.0
m2 = 5.0
std2 = 1.0

#Get point of intersect
result = solve(m1,m2,std1,std2)

#Get point on surface
x = np.linspace(-5,9,10000)
plot1=plt.plot(x,norm.pdf(x,m1,std1))
plot2=plt.plot(x,norm.pdf(x,m2,std2))
plot3=plt.plot(result,norm.pdf(result,m1,std1),'o')

#Plots integrated area
r = result[0]
olap = plt.fill_between(x[x>r], 0, norm.pdf(x[x>r],m1,std1),alpha=0.3)
olap = plt.fill_between(x[x<r], 0, norm.pdf(x[x<r],m2,std2),alpha=0.3)

# integrate
area = norm.cdf(r,m2,std2) + (1.-norm.cdf(r,m1,std1))
print("Area under curves ", area)

plt.show()

The cdf is used to obtain the integral of the Gaussian here, although symbolic version of the Gaussian could be defined and scipy.quad employed (or something else). Alternatively, you could use a Monte Carlo method like this link (i.e. generate random numbers and reject any outside the range you want).

For some reason, I get area higher than 1? I thought 1 means completely overlapping. My m1, std1, m2, std2 = -3.1322112, 4.258081, -3.8036838, 4.7481666, — Shawn, Jan 20 '22 at 18:40

score 12 · Answer 2 · edited Nov 10 '19 at 19:15

12

Starting Python 3.8, the standard library provides the NormalDist object as part of the statistics module.

NormalDist can be used to compute the overlapping coefficient (OVL) between two normal distributions via the NormalDist.overlap(other) method which returns a value between 0.0 and 1.0 giving the overlapping area for two probability density functions:

from statistics import NormalDist

NormalDist(mu=2.5, sigma=1).overlap(NormalDist(mu=5.0, sigma=1))
# 0.2112995473337106

edited Nov 10 '19 at 19:15

cvanelteren

1,633
9
16

answered Mar 09 '19 at 19:14

Xavier Guihot

54,987
21
291
190

1

The potentially very useful statistics module of python 3.8 can be imported in 3.6: `url='https://raw.githubusercontent.com/python/cpython/3.8/Lib/statistics.py` `import urllib.request` `urllib.request.urlretrieve(url, basename(url))` `from statistics import *`. I bet there would be an elegant way to do this. – Nov 28 '20 at 00:35

amitdatta · Answer 3 · 2017-11-16T23:15:04.767

Ed's answer is great. However, I noticed that it does not work when there are two or infinite (completely overlapping) points of contact. Here is version of the code that handles these two cases as well.

If you also want to continue seeing the plots of the distributions, you can use Ed's code.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

def solve(m1,m2,std1,std2):
    a = 1./(2.*std1**2) - 1./(2.*std2**2)
    b = m2/(std2**2) - m1/(std1**2)
    c = m1**2 /(2*std1**2) - m2**2 / (2*std2**2) - np.log(std2/std1)
    return np.roots([a,b,c])

m1 = 2.5
std1 = 1.0
m2 = 5.0
std2 = 1.0

result = solve(m1,m2,std1,std2)
# 'lower' and 'upper' represent the lower and upper bounds of the space within which we are computing the overlap
if(len(result)==0): # Completely non-overlapping 
    overlap = 0.0

elif(len(result)==1): # One point of contact
    r = result[0]
    if(m1>m2):
        tm,ts=m2,std2
        m2,std2=m1,std1
        m1,std1=tm,ts
    if(r<lower): # point of contact is less than the lower boundary. order: r-l-u
        overlap = (norm.cdf(upper,m1,std1)-norm.cdf(lower,m1,std1))
    elif(r<upper): # point of contact is more than the upper boundary. order: l-u-r
        overlap = (norm.cdf(r,m2,std2)-norm.cdf(lower,m2,std2))+(norm.cdf(upper,m1,std1)-norm.cdf(r,m1,std1))
    else: # point of contact is within the upper and lower boundaries. order: l-r-u
        overlap = (norm.cdf(upper,m2,std2)-norm.cdf(lower,m2,std2))

elif(len(result)==2): # Two points of contact
    r1 = result[0]
    r2 = result[1]
    if(r1>r2):
        temp=r2
        r2=r1
        r1=temp
    if(std1>std2):
        tm,ts=m2,std2
        m2,std2=m1,std1
        m1,std1=tm,ts
    if(r1<lower):
        if(r2<lower):           # order: r1-r2-l-u
            overlap = (norm.cdf(upper,m1,std1)-norm.cdf(lower,m1,std1))
        elif(r2<upper):         # order: r1-l-r2-u
            overlap = (norm.cdf(r2,m2,std2)-norm.cdf(lower,m2,std2))+(norm.cdf(upper,m1,std1)-norm.cdf(r2,m1,std1))
        else:                   # order: r1-l-u-r2
            overlap = (norm.cdf(upper,m2,std2)-norm.cdf(lower,m2,std2))
    elif(r1<upper): 
        if(r2<upper):         # order: l-r1-r2-u
            print norm.cdf(r1,m1,std1), "-", norm.cdf(lower,m1,std1), "+", norm.cdf(r2,m2,std2), "-", norm.cdf(r1,m2,std2), "+", norm.cdf(upper,m1,std1), "-", norm.cdf(r2,m1,std1)
            overlap = (norm.cdf(r1,m1,std1)-norm.cdf(lower,m1,std1))+(norm.cdf(r2,m2,std2)-norm.cdf(r1,m2,std2))+(norm.cdf(upper,m1,std1)-norm.cdf(r2,m1,std1))
        else:                   # order: l-r1-u-r2
            overlap = (norm.cdf(r1,m1,std1)-norm.cdf(lower,m1,std1))+(norm.cdf(upper,m2,std2)-norm.cdf(r1,m2,std2))
    else:                       # l-u-r1-r2
        overlap = (norm.cdf(upper,m1,std1)-norm.cdf(lower,m1,std1))

What is `upper` and `lower`? Could you kindly explain the many different cases? — reschu, Jun 13 '16 at 10:58
upper and lower in this case would be the global_upper = max(distribution1, distribution2) and global_min = min(distribution1, distribution2). And if the idea is compute the relative overlap keeping distribution1 as the baseline then global_min = min(distribution1) and global_max = max(distribution2) — Pramit, Sep 07 '16 at 19:56
@FábioDias the comment was actually incorrect. when there are no solutions, it is actually 'completely non-overlapping'. so, overlap=0.0 in that case. — amitdatta, Nov 16 '17 at 23:17
What happens when m1==m2 and std1==std2? (I mean, it is easy enough to catch, I'm wondering what happens as is) — Fábio Dias, Nov 17 '17 at 22:11
The only way of not finding solutions in the overlap of two normal distributions is that they are exactly the same, as @FábioDias commented. If they are the same, the overlap should be 1. Additionally, the only way to have only one point of contact is when the sigmas are similar (but the mus are different). At least this happens analytically, numerically there may be differences although I cannot think of any now. — nudomarinero, Feb 26 '20 at 17:47
@Pramit You said, `global_upper = max(distribution1, distribution2)` and `global_min = min(distribution1, distribution2)` , can you accommodate this is the code above? I still can't understand where does this fits? — Milind Dalvi, Aug 10 '20 at 06:28

Overlapping probability of two normal distribution with scipy

3 Answers3

Linked