0

I have two sets of data:

Dataset 1:

-4.96600134256044 
-4.78340374913002
-4.93136896680689
-4.80958108060998
-4.78688287192542
-4.9431452930913
-4.93676628405869
-4.87328189586985
-4.91867843591513
-4.72101863119006
-4.95749167305945
-4.79202404641664
-4.91265785779198
-4.94596580589554
-4.96595222256787
-4.7990191635208
-4.97194852291884
-4.78515347272161
-4.78340374913002
-4.8994168374135
-4.97206198058066
-4.95252689510477
-4.93963055552644
-4.95490836707013
-4.94133564424905
-4.78567577865158
-4.93963055552644
-4.93131563559386
-4.9710618452962
-4.90015209439797
-4.9665194453887
-4.93403567225855
-4.91165041153205
-4.85009602823937
-4.78340374913002
-4.77292978439906
-4.94782851444531
-4.64848347534667
-4.91165041153205
-4.82937702765807
-4.96202809430577
-4.7983814963622
-4.93198889539142
-4.97072129594592
-4.88775205449138
-4.96917754667146
-4.972240408012
-4.96062137229138
-4.84390165131993
-4.93630849353535
-4.92623245728544
-4.91859094033325
-4.89568644535618
-4.87243553740634
-4.76982873302833
-4.8953404941385
-4.94451830002783
-4.88104841757604
-4.80303414573805
-4.88705883246573
-4.96499558513462
-4.56610914869673
-4.96928985131163
-4.80780803677881
-4.9556234540787
-4.84808934356167
-4.72319662154655
-4.9575854510567
-4.96960730728536
-4.9056755790436
-4.94039653820335
-4.53920246550341
-4.97211181130125
-4.86213634700864
-4.96802952189005
-4.9717135485154
-4.82056508210921
-4.96777645971916
-4.94038569046493
-4.95173085290477
-4.83470303172871
-4.91551379314551
-4.93963055552644
-4.97211181086369
-4.807583383435
-4.97216236251657
-4.97232745985347
-4.91551379314551
-4.94522084426514
-4.89719997383376
-4.96071975048121
-4.93464863469402
-4.88775205449138
-4.91638381844513
-4.80256598250479
-4.79828215315771
-4.73688107699373
-4.88114134915641
-4.92310502488463

Dataset 2:

-4.96600134256044
-4.78340374913002
-4.93136896680689
-4.80958108060998
-4.78688287192542
-4.9431452930913
-4.93676628405869
-4.87328189586985
-4.91867843591513
-4.72101863119006
-4.95749167305945
-4.79202404641664
-4.91265785779198
-4.94596580589554
-4.96595222256787
-4.7990191635208
-4.97194852291884
-4.78515347272161
-4.78340374913002
-4.8994168374135
-4.97206198058066
-4.95252689510477
-4.93963055552644
-4.95490836707013
-4.94133564424905
-4.78567577865158
-4.93963055552644
-4.93131563559386
-4.9710618452962
-4.90015209439797
-4.9665194453887
-4.93403567225855
-4.91165041153205
-4.85009602823937
-4.78340374913002
-4.77292978439906
-4.94782851444531
-4.64848347534667
-4.91165041153205
-4.82937702765807
-4.96202809430577
-4.7983814963622
-4.93198889539142
-4.97072129594592
-4.88775205449138
-4.96917754667146
-4.972240408012
-4.96062137229138
-4.84390165131993
-4.93630849353535
-4.92623245728544
-4.91859094033325
-4.89568644535618
-4.87243553740634
-4.76982873302833
-4.8953404941385
-4.94451830002783
-4.88104841757604
-4.80303414573805
-4.88705883246573
-4.96499558513462
-4.56610914869673
-4.96928985131163
-4.80780803677881
-4.9556234540787
-4.84808934356167
-4.72319662154655
-4.9575854510567
-4.96960730728536
-4.9056755790436
-4.94039653820335
-4.53920246550341
-4.97211181130125
-4.86213634700864
-4.96802952189005
-4.9717135485154
-4.82056508210921
-4.96777645971916
-4.94038569046493
-4.95173085290477
-4.83470303172871
-4.91551379314551
-4.93963055552644
-4.97211181086369
-4.807583383435
-4.97216236251657
-4.97232745985347
-4.91551379314551
-4.94522084426514
-4.89719997383376
-4.96071975048121
-4.93464863469402
-4.88775205449138
-4.91638381844513
-4.80256598250479
-4.79828215315771
-4.73688107699373
-4.88114134915641
-4.92310502488463

And I'm trying to plot them as histograms and then measure the overlap between the histograms as a percentage of the total area of the histograms. I tried using the method suggested in this post, but that gave me an answer larger than 1--which I didn't think would be possible.

My code looks like this:

rng = min(dataset1.min(),dataset2.min()),max(dataset1.max(),dataset2.max())
n1, bins1,_= plt.hist(dataset1,color = color1,alpha = 0.75,bins=7,weights =(np.ones_like(dataset1)/len(dataset1)),range=rng)
n1_area = sum(np.diff(bins1)*n1)
n2, bins2,_ = plt.hist(dataset2,color = color2,alpha = 0.75,bins = 7,weights =(np.ones_like(dataset2)/len(dataset2)),range=rng)
n2_area = sum(np.diff(bins2)*n2)
overlap = np.minimum(n1,n2)
overlap_area = overlap.sum()
overlap_percentage=overlap_area/(n1_area+n2_area)

Anyone have any idea why I'm getting a percentage that's over 1, and how to fix it so that I get the correct value?

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Tessa
  • 79
  • 1
  • 11

2 Answers2

1

It seems that you calc the real "area" of the hist figure for n1 and n2 by n1_area=sum(np.diff(bins1)*n1). But the overlap is just the count for samples. They are practically incomparable.
You can use both count for samples, i.e. overlap.sum(), or use both "area", i.e. sum(np.diff(bins1)*n1). But not to mix them.
To be more clear, the last percentage should be calculated as overlap / (n1 + n2 - overlap). Since the total area of n1 and n2 with an overlap is (n1 + n2 - overlap). Illustration

ease_zh
  • 46
  • 2
1
  • Some of the code will be similar to How to plot the difference between two histograms, except density will be used in np.histogram.
    • In order to calculate the overlap, the bin edges of the two histograms must be the same.
  • np.histogram
    • density : bool, optional If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.

    • In this case, the bin widths are 0.5, so h1 and h2 need to be multiplied by 0.5.
  • Custom normal samples are used, because the two datasets in the OP are exactly the same.
  • Tested in python 3.11.3, pandas 2.0.2, matplotlib 3.7.1, seaborn 0.12.2, numpy 1.24.3
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# sample datasets
np.random.seed(2023)
dataset1 = np.random.normal(loc=9, scale=1.5, size=100)
dataset2 = np.random.normal(loc=8, scale=0.6, size=100)

# create a long form dataframe for use with seaborn
df = pd.DataFrame({'ds1': dataset1, 'ds2': dataset2}).melt()

# calculate the density hist for each dataset with specified matching bin edges
h1, be1 = np.histogram(dataset1, bins=np.arange(4, 13.1, 0.5), density=True)
h2, be2 = np.histogram(dataset2, bins=np.arange(4, 13.1, 0.5), density=True)
plt.figure(figsize=(12, 4))
ax = sns.histplot(data=df, x='value', stat='density', hue='variable', multiple='dodge', bins=np.arange(4, 13.1, 0.5))
ax.set_xticks(be2)

ax.margins(x=0)

for c in ax.containers:
    _ = ax.bar_label(c, fontsize=8)

enter image description here

  • Calculate the overlapping percent by creating a logical_and map for the Booleans, where h1 and h2 are not equal to 0.
# create a mask for where each data set is non-zero
m1 = h1 != 0
m2 = h2 != 0

# use a logical and to create a combined map where both datasets are non-zero
ol = np.logical_and(m1, m2)

# calculate the overlapping density, where 0.5 is the bin width
ol_density = np.abs((h1 - h2) * 0.5)[ol]

# calculate the total overlap percent
ol_perecent = ol_density.sum() * 100

ol_perecent → 71.00000000000001
  • Plot the absolute value of the overlapping areas with sns.barplot.
  • Comparing the next plot to the previous plot shows bars from the bins of the overlapping data.
  • The sum of the bar value annotations is equal to ol_percent.
# calculate the absolute difference for each bin
y = np.abs(h1 - h2) * 0.5

# set non-overlapping bins to 0
y[~ol] = 0

plt.figure(figsize=(12, 4))
ax = sns.barplot(y=y, x=be1[:-1], width=1, ec='k', color='purple', alpha=0.75)
_ = ax.set_xticks(ticks=np.arange(0, 18, 1)-0.5, labels=be1[:-1])

ax.margins(x=0, y=0.1)

for c in ax.containers:
    _ = ax.bar_label(c, fontsize=8, padding=3)

enter image description here

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158