1

I know that, by default histogram approach is to count the number of occurrences. Instead, we can visualize the distribution with density or probability.

sns.displot(data, stat = 'density')

or

sns.displot(data, stat = 'probability')

my question is which cases I should use stat = 'density' or stat = 'probability' ?

Mourad BENKDOUR
  • 879
  • 2
  • 9
  • 11
  • 2
    These are parameters for the underlying `sns.histplot`. With `stat='density'` the *area* of all the bars sums to `1`. With `stat='probability'` the *heights* of the bars sum to `1`. A *density* plot will be similar in size as a *probability density function*. A density plot is most appropriate for a continuous random variable; a probability plot would be more appropriate for a discrete random variable. – JohanC Sep 27 '22 at 21:30
  • sns.displot and sns.histplot are both stat parameter, but sns.displot is most generic than sns.histplot – Mourad BENKDOUR Sep 28 '22 at 16:19
  • 2
    `sns.displot` with the default `kind='hist'` creates a grid of histograms. When nor the `row=` nor the `col=` parameters are used, it looks and behaves a lot like `sns.histplot`. – JohanC Sep 28 '22 at 21:08

1 Answers1

2

stat = 'density' creates a probability density function (PDF) (Wikipedia).
As JohanC mentioned in the comments a key aspect of a PDF is that the area under the curve (or all bars together) is 1. So the bars width is taken into account for along with the bars height.

stat = 'probability' creates the same bars (incl. their same width) but each height (y axis value) directly states the probability of that bin. And the sum of all the bars heights is 1.


Which one to use kinda depends on what you want to 'show' with your plot and what's the audience.

'probability' is more intuitive and is understandable for stacked bars as well.
'density' is better suited for expert audience that is familiar with PDF.

Also since PDFs usually display a continuous curve 'density' with displot and bins is better suited for a larger amount of bins, while 'probability' with displot works intuitive also for e.g. 2 bins.


Seaborn tutorial Visualizing distributions of data - Normalized histogram statistics provides explanations and example plots.
To visualize the statements from this answer reduced example data and plots along with a different angle of explanation are used in the following.


data preparation: (df conversion is kept basic - to have the # print for easy cross check)

import pandas as pd
import seaborn as sns


penguins = sns.load_dataset("penguins")
penguins_strip = penguins[['flipper_length_mm', 'sex']].dropna()
# print(penguins_strip)
print('Female and Male')
print(f'range: {penguins_strip["flipper_length_mm"].max() - penguins_strip["flipper_length_mm"].min()}')
print(f'len: {len(penguins_strip)}')

penguins_strip_male = penguins_strip[penguins_strip['sex'] == 'Male']
# print(penguins_strip_male)
print('Male only')
print(f'range: {penguins_strip_male["flipper_length_mm"].max() - penguins_strip_male["flipper_length_mm"].min()}')
print(f'len: {len(penguins_strip_male)}')
Female and Male
range: 59.0
len: 333

Male only
range: 53.0
len: 168

A function displaying values on top of the displot bars - heavily based on that answer from Trenton McKinney

def show_values(plot):
    for ax in plot.axes.ravel():
        # add annotations
        for c in ax.containers:
            # custom label calculates percent and add an empty string so 0 value bars don't have a number
            labels = [f'{w:0.5f}' if (w := v.get_height()) > 0 else '' for v in c]
            ax.bar_label(c, labels=labels, label_type='edge', fontsize=8, rotation=0, padding=2)
        ax.margins(y=0.2)

Note: Due to the limited displayed float digits some of the following calculations are rounded.


2 bins, 'Male' flippers only

Default displot (without stat):

enter image description here

'probability' plot - note the intuitive y-axis probability for each bin that add up to 1.

enter image description here

'density' plot - see area calculations below

enter image description here

0.02156 * (53/2) = 0.57134
0.01617 * (53/2) = 0.428505
# see data preparation above, range is 53, and it's 2 bins

Adding these two areas up is 1 (rounding aside).
You can try bins_nr = 1 and check the area easily for that. While for 'probability' with bins_nr = 1 y will just be 1.

Code of the plots

bins_nr = 2

displot_default = sns.displot(penguins_strip_male, x="flipper_length_mm", hue="sex", 
                              bins=bins_nr, multiple="dodge")
show_values(displot_default)
    
displot_density = sns.displot(penguins_strip_male, x="flipper_length_mm", hue="sex", 
                              bins=bins_nr, multiple="dodge", stat = 'density')
show_values(displot_density)
        
displot_probability = sns.displot(penguins_strip_male, x="flipper_length_mm", hue="sex", 
                                  bins=bins_nr, multiple="dodge", stat = 'probability')
show_values(displot_probability)

Stacked plot example (only feasible for 'probability')

enter image description here

displot_probability_stacked = sns.displot(penguins_strip, x="flipper_length_mm", hue="sex", 
                                  bins=bins_nr, multiple="stack", stat = 'probability')
show_values(displot_probability_stacked)

Addon: In case you wonder about the common_norm example from the tutorial check

displot_density = sns.displot(penguins_strip, x="flipper_length_mm", hue="sex", 
                              bins=bins_nr, multiple="dodge", stat = 'density')
show_values(displot_density)

displot_density_common = sns.displot(penguins_strip, x="flipper_length_mm", hue="sex", bins=bins_nr, 
                multiple="dodge", stat = 'density', common_norm=False)
show_values(displot_density_common)

and calculate the areas.

MagnusO_O
  • 1,202
  • 4
  • 13
  • 19
  • thanks for your answer, but what do you think about @JohanC answer: " A density plot is most appropriate for a continuous random variable; a probability plot would be more appropriate for a discrete random variable" – Mourad BENKDOUR Sep 29 '22 at 13:51
  • @Mourad BENKDOUR See the part in the answer talking about the number of bins. When number of bins gets very large the 'bars' approximate a continuous function. – MagnusO_O Sep 30 '22 at 10:34