I am using seaborn.catplot
with kind='point'
to plot my data. I would like to calculate the standard error of the mean (SEM) for each hue var and each category using the same method as seaborn in order to make sure that my computed values exactly match the plotted error bars. The default solution for calculating the SEM and the 95%-confidence intervals (CIs) contains a bootstrapping algorithm, where the mean is bootstrapped 1000 times in order to calculate the SEM/CIs. In an earlier post, I saw a method that might offer functions for that (using seaborn source code functions like seaborn.utils.ci()
and seaborn.algorithms.bootstrap()
) but I am not sure how to implement it. Since the bootstrapping uses random sampling it would also be necessary to make sure that the same array of 1000 means is produced both for plotting and for obtaining the SEM.
Here is a code example:
import numpy as np
import pandas as pd
import seaborn as sns
# simulate data
rng = np.random.RandomState(42)
measure_names = np.tile(np.repeat(['Train BAC','Test BAC'],10),2)
model_numbers = np.repeat([0,1],20)
measure_values = np.concatenate((rng.uniform(low=0.6,high=1,size=20),
rng.uniform(low=0.5,high=0.8,size=20)
))
folds=np.tile([1,2,3,4,5,6,7,8,9,10],4)
plot_df = pd.DataFrame({'model_number':model_numbers,
'measure_name':measure_names,
'measure_value':measure_values,
'outer_fold':folds})
# plot data as pointplot
g = sns.catplot(x='model_number',
y='measure_value',
hue='measure_name',
kind='point',
seed=rng,
data=plot_df)
which produces:
I would like to obtain the SEM for all train and test scores for both models. That is:
# obtain SEM for each score in each model using the same method as in sns.catplot
model_0_train_bac = plot_df.loc[((plot_df['model_number'] == 0) & (plot_df['measure_name'] == 'Train BAC')),'measure_value']
model_0_test_bac = plot_df.loc[((plot_df['model_number'] == 0) & (plot_df['measure_name'] == 'Test BAC')),'measure_value']
model_1_train_bac = plot_df.loc[((plot_df['model_number'] == 1) & (plot_df['measure_name'] == 'Train BAC')),'measure_value']
model_1_test_bac = plot_df.loc[((plot_df['model_number'] == 1) & (plot_df['measure_name'] == 'Test BAC')),'measure_value']