2

I'm plotting two distributions as histplots, and would like to visualize the difference between them. The distributions are rather similar:

my plots

The code I am using to generate one of these plots looks like this:

sns.histplot(
    data=dfs_downvoted_percentages["only_pro"],
    ax=axes[0],
    x="percentage_downvoted",
    bins=30,
    stat="percent",
)

My supervisor suggested plotting the difference between the normalized distributions, basically displaying the subtraction of one plot form the other. The end result should be a plot where some bins go below 0 (if the bins in plot 2 are larger than in plot 1). Thus, similarities between the plots are erased and differences highlighted.

  1. Does this make sense? The plots are part of a paper which will hopefully be published; I haven't seen such a plot before, but as he explained it, it makes sense to me. Are there better ways to visualize what I want to express? I already have another plot where I filter out all values with x=0, so that the other ones become more visible.
  2. Is there an easy way to achieve this utilizing seaborn?

If not: I know how I can normalize the data and calculate percentage for each bin by hand. But what I couldn't find is a kind of plot that consists of bins and offers the possibility to have negative bins. I know how I could create a lineplot with 30 data points showing the calculated difference, but I'd rather have it visually similar to the original plots with bins instead of a line. What kind of plot could I use for that?

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
schadenfreude
  • 212
  • 4
  • 15

1 Answers1

1
  • Use np.histogram, which returns hist and bin_edges.
    • The same bin_edges must be used for both function calls.
    • Subtract the hist of each dataframe, and plot it against bin_edges.
  • Plot h_diff as a bar plot.
    • There is one more bin_edge than there are bars, so select all but the last value, bin_edges[:-1], for the x-axis labels passed to x=.
    • The x-ticks of a sns.barplot are 0-indexed, so reset the ticks with an extra tick, off-set them by -0.5, and relabel the ticks with all the bin_edges.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# sample data
np.random.seed(2023)
a = np.random.normal(50, 15, (100,))
b = np.random.normal(30, 8, (100,))

# dataframe from sample distributions
df = pd.DataFrame({'a': a, 'b': b})

# calculate the histogram for each distribution
bin_edges = np.arange(10, 91, 10)

a_hist, _ = np.histogram(df.a, bins=bin_edges) 
b_hist, _ = np.histogram(df.b, bins=bin_edges) 

# calculate the difference
h_diff = a_hist - b_hist

# plot
fig, ax = plt.subplots(figsize=(7, 5))
sns.barplot(x=bin_edges[:-1], y=h_diff, color='tab:blue', ec='k', width=1, alpha=0.8, ax=ax)
ax.set_xticks(ticks=np.arange(0, 9)-0.5, labels=bin_edges)
ax.margins(x=0.1)
_ = ax.set(title='Difference between Sample A and B: hist(a) - hist(b)', ylabel='Difference', xlabel='Bin Ranges')

enter image description here

  • An alternate option, which I think is a better presentation of the data, and serves the purpose of showing the distribution of both data sets, is to plot the histograms together with dodged bars.
fig, ax = plt.subplots(figsize=(7, 5))
sns.histplot(data=df, multiple='dodge', common_bins=True, ax=ax, bins=bin_edges)

enter image description here

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158