1

I am trying to find a way to calculate the overlapping (intersection) and non-overlapping areas of two distributions. Based on some posts (like this: FINDING AREA) I could figure out how to calculate the intersection of the two plots. But I had no luck finding the non-overlapping part.
Given this example distributions:

import pandas as pd
import numpy as np
import seaborn as  sns

data1=np.random.normal(0, 0.1, 100)
data2=np.random.normal(0, 0.3, 100)

x0=pd.Series(data1)
x1=pd.Series(data2)
kwargs = dict(hist_kws={'alpha':.01}, kde_kws={'linewidth':2})
sns.distplot(x0,bins=3, color="dodgerblue", **kwargs)
sns.distplot(x1,bins=3, color="dodgerblue", **kwargs)

I wonder how I can calculate the areas between the two distributions at the edges?
I should mention that I have the data as well (if that is something that I could do with the data itself). enter image description here

Zephyr
  • 11,891
  • 53
  • 45
  • 80
Seji
  • 371
  • 1
  • 10
  • 3
    My inital thought is to find the intersection points of the two distributions then break it down to a couple of pieces. Say the smaller peaked distribution is A and the larger peak B. For the purple part, integrate A - B from start of A to left intersection. For the red part, right intersection of A and B, then integrate A - B until the end of A. Data would really help though, and would help distract me from my acutal work :) – Matt Aug 25 '21 at 18:05
  • Yes, please provide the data – mozway Aug 25 '21 at 18:17
  • The area of a kde is 1. So the non-overlapping part is 1 minus the overlapping part. In general there won't be a nice division into 4 separate regions. – JohanC Aug 25 '21 at 18:20

1 Answers1

2

Taking inspiration from the answer you linked, from data you should compute kde curves through scipy.stats.gaussian_kde:

data1 = np.random.normal(0, 0.1, 100)
data2 = np.random.normal(0, 0.3, 100)

kde1 = gaussian_kde(data1, bw_method = 0.5)
kde2 = gaussian_kde(data2, bw_method = 0.5)

xmin = min(data1.min(), data2.min())
xmax = max(data1.max(), data2.max())
dx = 0.2*(xmax - xmin)
xmin -= dx
xmax += dx

x = np.linspace(xmin, xmax, 1000)
kde1_x = kde1(x)
kde2_x = kde2(x)

Then you need to find intersection points between the kde curves:

idx = np.argwhere(np.diff(np.sign(kde1_x - kde2_x))).flatten()

where idx is the list of index of intersection points.
Finally, you can compute area through numpy.trapz, using index previously computed to slice x, kde1_x and kde2_x:

area1 = np.trapz(kde2_x[:idx[0]] - kde1_x[:idx[0]], x[:idx[0]]) # area under the lower kde, from the first leftmost point to the first intersection point
area2 = np.trapz(kde2_x[idx[1]:] - kde1_x[idx[1]:], x[idx[1]:]) # area under the lower kde, from the second intersection point to the last rightmost point
area3 = np.trapz(np.minimum(kde1_x, kde2_x), x) # intersection area between of the kde curves, between the two intersection points
area4 = np.trapz(kde1_x[idx[0]:idx[1]] - kde2_x[idx[0]:idx[1]], x[idx[0]:idx[1]]) # area under the highest kde, excluding the lower one, between the two intersection points

Complete Code

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde


data1 = np.random.normal(0, 0.1, 100)
data2 = np.random.normal(0, 0.3, 100)

kde1 = gaussian_kde(data1, bw_method = 0.5)
kde2 = gaussian_kde(data2, bw_method = 0.5)

xmin = min(data1.min(), data2.min())
xmax = max(data1.max(), data2.max())
dx = 0.2*(xmax - xmin)
xmin -= dx
xmax += dx

x = np.linspace(xmin, xmax, 1000)
kde1_x = kde1(x)
kde2_x = kde2(x)


idx = np.argwhere(np.diff(np.sign(kde1_x - kde2_x))).flatten()

area1 = np.trapz(kde2_x[:idx[0]] - kde1_x[:idx[0]], x[:idx[0]]) # area under the lower kde, from the first leftmost point to the first intersection point
area2 = np.trapz(kde2_x[idx[1]:] - kde1_x[idx[1]:], x[idx[1]:]) # area under the lower kde, from the second intersection point to the last rightmost point
area3 = np.trapz(np.minimum(kde1_x, kde2_x), x) # intersection area between of the kde curves, between the two intersection points
area4 = np.trapz(kde1_x[idx[0]:idx[1]] - kde2_x[idx[0]:idx[1]], x[idx[0]:idx[1]]) # area under the highest kde, excluding the lower one, between the two intersection points


fig, ax = plt.subplots()

ax.plot(x, kde1_x, color = 'dodgerblue', linewidth = 2)
ax.plot(x, kde2_x, color = 'orangered', linewidth = 2)

ax.fill_between(x[:idx[0]], kde2_x[:idx[0]], kde1_x[:idx[0]], color = 'dodgerblue', alpha = 0.3, label = 'area1')
ax.fill_between(x[idx[1]:], kde2_x[idx[1]:], kde1_x[idx[1]:], color = 'orangered', alpha = 0.3, label = 'area2')
ax.fill_between(x, np.minimum(kde1_x, kde2_x), 0, color = 'lime', alpha = 0.3, label = 'area3')
ax.fill_between(x[idx[0]:idx[1]], kde1_x[idx[0]:idx[1]], kde2_x[idx[0]:idx[1]], color = 'gold', alpha = 0.3, label = 'area4')

ax.plot(x[idx], kde2_x[idx], 'ko')

handles, labels = ax.get_legend_handles_labels()
labels[0] += f': {area1 * 100:.1f}%'
labels[1] += f': {area2 * 100:.1f}%'
labels[2] += f': {area3 * 100:.1f}%'
labels[3] += f': {area4 * 100:.1f}%'
ax.legend(handles, labels)

plt.show()

enter image description here

Zephyr
  • 11,891
  • 53
  • 45
  • 80
  • 2
    Nice. Some remarks: I think the areas need to go up until the intersection, otherwise you have a little gap in between (visible when zooming). So e.g. `area1 = np.trapz(kde2_x[:idx[0]+1] - kde1_x[:idx[0]+1], x[:idx[0]+1])`. Apart from that, the `np.diff` provokes a shift of one position (I think the real intersection is between `idx[0]+1` and `idx[0]+2`). See also [How to find the exact intersection of a curve?](https://stackoverflow.com/questions/46909373/how-to-find-the-exact-intersection-of-a-curve-as-np-array-with-y-0). – JohanC Aug 25 '21 at 21:28
  • 2
    My idea for `area1` would have been `np.trapz(kde2_x[x<0] - np.minimum(kde1_x, kde2_x)[x<0], x[x<0])`, in the supposition that we know the blue area is fully left of zero. Similar for red. Yellow is just kde1 minus green. Of course, either approach supposes that there are exactly 2 intersection points. – JohanC Aug 25 '21 at 21:38
  • Nice …. But note that the `gaussian_kde` object has some methods on it to do integration, which will be easier and more accurate. – mwaskom Aug 26 '21 at 10:36