0

I have originally used numpy function .std on my dataframe to obtain standard deviation and plot it using matplotlib. Later, I have tried making the same graph using seaborn. The two graphs looked close enough until I overlayed them and found that all error bars from seaborn are smaller - the difference being more pronounced the bigger they are. I checked in different software that the results from .std are correct and that they are also correctly plotted. What could be the source of problems (I can't seem to be able to pull out the graph source data from seaborn)?

I used this code: ax_sns = sns.barplot(x = 'name', y = column_to_plot, data=data, hue='method', capsize=0.1, ci='sd', errwidth=0.9)

the graph - seaborn errorbars are smaller - the darker ones

Lucie
  • 3
  • 1

1 Answers1

1

You didn't provide the code where you calculated the standard deviation. Perhaps you used pandas .std(). Seaborn uses numpy's. Numpy's std uses the "Bessel's correction". The difference is most visible when the number of data points is small (when / n vs / (n-1) is larger).

The following code visualizes the difference between error bars calculated via seaborn, numpy and pandas.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

flights = sns.load_dataset('flights')
fig, ax = plt.subplots(figsize=(12, 5))
sns.barplot(x='month', y='passengers', data=flights, capsize=0.1, ci='sd', errwidth=0.9, fc='yellow', ec='blue', ax=ax)

flights['month'] = flights['month'].cat.codes  # change to a numeric format
for month, data in flights.groupby('month'):
    mean = data['passengers'].mean()
    pandas_std = data['passengers'].std()
    numpy_std = np.std(data['passengers'])
    ax.errorbar(month - 0.2, mean, yerr=numpy_std, ecolor='crimson', capsize=8,
                label='numpy std()' if month == 0 else None)
    ax.errorbar(month + 0.2, mean, yerr=pandas_std, ecolor='darkgreen', capsize=8,
                label='pandas std()' if month == 0 else None)
ax.margins(x=0.015)
ax.legend()
plt.tight_layout()
plt.show()

sns.barplot with numpy vs pandas errorbars

PS: Some related posts with additional information:

JohanC
  • 71,591
  • 8
  • 33
  • 66
  • That is exactly what I needed! Thank you! From a quick look, it seems that Bessel's correction should probably only be used for large enough sample sizes. Mine is too small for that. While I found a way to turn it off in numpy `np.std(ddof=1)`, is there a way to do it in seaborn too? – Lucie Nov 22 '21 at 11:45
  • No, seaborn only supports numpy's default calculation. The only workaround is calculating and drawing the errorbars with matplotlib on top of the seaborn plot. For large sample sizes, the difference is very small. For small sample sizes, Bessel's correction takes the uncertainty of the calculated mean into account. – JohanC Nov 22 '21 at 12:52