-1

I run into a problem when trying to plot my dataset with a seaborn boxplot. I've got a dataset received grouped from database like:

         region   age  total
0              STC   2.0  11024
1              PHA  84.0   3904
2              OLK  55.0  12944
3              VYS  72.0   5592
4              PAK  86.0   2168
...            ...   ...    ...
1460           KVK  62.0   4600
1461           MSK  41.0  26568
1462           LBK  13.0   6928
1463           JHC  18.0   8296
1464           HKK  88.0   2408

And I would like to create a box plot with the region on an x-scale, age on a y-scale, based on the total number of observations.

When I try ax = sns.boxplot(x='region', y='age', data=df), I receive a simple boxplot, where isn't taking the total column into account. The one, hard-coding option is to repeat rows by a number of totals, but I don't like this solution.

JohanC
  • 71,591
  • 8
  • 33
  • 66
  • 1
    Welcome to [Stack Overflow.](https://stackoverflow.com/ "Stack Overflow"). In order for us to help you, it is necessary that you provide a minimal reproducible problem set consisting of sample input, expected output, actual output, and all relevant code necessary to reproduce the example. What you have provided falls short of this goal. Please edit your question to show a minimal reproducible set. See [Minimal Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example "Minimal Reproducible Example") for details. – itprorh66 Nov 21 '21 at 15:24

1 Answers1

0

sns.histplot and sns.kdeplot support a weigts= parameter, but sns.boxplot doesn't. Simply repeating values doesn't need to be a bad solution, but in this case the numbers are very huge. You could create a new dataframe with repeated data, but divide the 'total' column to make the values manageable.

The sample data have all different regions, which makes creating a boxplot rather strange. The code below supposes there aren't too many regions (1400 regions certainly wouldn't work well).

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from io import StringIO

df_str = ''' region   age  total
                STC   2.0  11024
                STC  84.0   3904
                STC  55.0  12944
                STC  72.0   5592
                STC  86.0   2168
                PHA  62.0   4600
                PHA  41.0  26568
                PHA  13.0   6928
                PHA  18.0   8296
                PHA  88.0   2408'''
df = pd.read_csv(StringIO(df_str), delim_whitespace=True)
# use a scaled down version of the totals as a repeat factor
repeats = df['total'].to_numpy(dtype=int) // 100
df_total = pd.DataFrame({'region': np.repeat(df['region'].values, repeats),
                         'age': np.repeat(df['age'].values, repeats)})

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 4))
sns.kdeplot(data=df, x='age', weights='total', hue='region', ax=ax1)
sns.boxplot(data=df_total, y='age', x='region', ax=ax2)
plt.tight_layout()
plt.show()

boxplot from repeated data

An alternative would be to do everything outside seaborn, using statsmodels.stats.weightstats.DescrStatsW to calculate the percentiles and plot the boxplots via matplotlib. Outliers would still have to be calculated separately. (See also this post)

JohanC
  • 71,591
  • 8
  • 33
  • 66
  • Well, there are only 14 regions at all. I assume that dividing the total number and then repeating will be a good way. Thanks a lot for your advice! – Adam Grünwald Nov 22 '21 at 16:05