Grouping boxplots in seaborn when input is a DataFrame

Question

I intend to plot multiple columns in a pandas dataframe, all grouped by another column using groupby inside seaborn.boxplot. There is a nice answer here, for a similar problem in matplotlib matplotlib: Group boxplots but given the fact that seaborn.boxplot comes with groupby option I thought it could be much easier to do this in seaborn.

Here we go with a reproducible example that fails:

import seaborn as sns
import pandas as pd
df = pd.DataFrame([[2, 4, 5, 6, 1], [4, 5, 6, 7, 2], [5, 4, 5, 5, 1],
                   [10, 4, 7, 8, 2], [9, 3, 4, 6, 2], [3, 3, 4, 4, 1]],
                  columns=['a1', 'a2', 'a3', 'a4', 'b'])

# display(df)
   a1  a2  a3  a4  b
0   2   4   5   6  1
1   4   5   6   7  2
2   5   4   5   5  1
3  10   4   7   8  2
4   9   3   4   6  2
5   3   3   4   4  1

#Plotting by seaborn
sns.boxplot(df[['a1','a2', 'a3', 'a4']], groupby=df.b)

What I get is something that completely ignores groupby option:

Failed groupby

Whereas if I do this with one column it works thanks to another SO question Seaborn groupby pandas Series :

sns.boxplot(df.a1, groupby=df.b)

seaborn that does not fail

So I would like to get all my columns in one plot (all columns come in a similar scale).

EDIT:

The above SO question was edited and now includes a 'not clean' answer to this problem, but it would be nice if someone has a better idea for this problem.

mwaskom · Answer 1 · 2014-08-13T14:47:35.450

28

As the other answers note, the boxplot function is limited to plotting a single "layer" of boxplots, and the groupby parameter only has an effect when the input is a Series and you have a second variable you want to use to bin the observations into each box..

However, you can accomplish what I think you're hoping for with the factorplot function, using kind="box". But, you'll first have to "melt" the sample dataframe into what is called long-form or "tidy" format where each column is a variable and each row is an observation:

df_long = pd.melt(df, "b", var_name="a", value_name="c")

Then it's very simple to plot:

sns.factorplot("a", hue="b", y="c", data=df_long, kind="box")

enter image description here

edited Aug 13 '14 at 14:47

answered Aug 13 '14 at 14:39

mwaskom

46,693
16
125
127

12

This gets occasional upvotes, but FWIW nested boxplots have been possible in `sns.boxplot` since 0.6. – mwaskom Jan 22 '16 at 20:25
1

this `melt` is insane and super unexpected – seralouk May 05 '20 at 11:40

score 11 · Accepted Answer · edited Jun 16 '22 at 17:46

You can use directly boxplot (I imagine when the question was asked, that was not possible, but with seaborn version > 0.6 it is).

As explained by @mwaskom, you have to "melt" the sample dataframe into its "long-form" where each column is a variable and each row is an observation:

df_long = pd.melt(df, "b", var_name="a", value_name="c")

# display(df_long.head())
   b   a   c
0  1  a1   2
1  2  a1   4
2  1  a1   5
3  2  a1  10
4  2  a1   9

Then you just plot it:

sns.boxplot(x="a", hue="b", y="c", data=df_long)

jrjc · Answer 3 · 2014-08-13T14:35:35.417

8

Seaborn's groupby function takes Series not DataFrames, that's why it's not working.

As a work around, you can do this :

fig, ax = plt.subplots(1,2, sharey=True)
for i, grp in enumerate(df.filter(regex="a").groupby(by=df.b)):
    sns.boxplot(grp[1], ax=ax[i])

it gives : sns

Note that df.filter(regex="a") is equivalent to df[['a1','a2', 'a3', 'a4']]

   a1  a2  a3  a4
0   2   4   5   6
1   4   5   6   7
2   5   4   5   5
3  10   4   7   8
4   9   3   4   6
5   3   3   4   4

Hope this helps

edited Aug 13 '14 at 14:35

answered Aug 13 '14 at 14:16

jrjc

21,103
9
64
78

Thanks I accepted the answer below because it gives all the plot in a single figure. – Arman Aug 13 '14 at 14:54

score 5 · Answer 4 · answered Aug 13 '14 at 11:55

It isn't really any better than the answer you linked, but I think the way to achieve this in seaborn is using the FacetGrid feature, as the groupby parameter is only defined for Series passed to the boxplot function.

Here's some code - the pd.melt is necessary because (as best I can tell) the facet mapping can only take individual columns as parameters, so the data need to be turned into a 'long' format.

g = sns.FacetGrid(pd.melt(df, id_vars='b'), col='b')
g.map(sns.boxplot, 'value', 'variable')

faceted seaborn boxplot

It's actually not necessary to use `FacetGrid` directly if you want this kind of plot, you can use `factorplot` here too with `col=b`. (This isn't wrong, it's just more work than necessary). — mwaskom, Aug 13 '14 at 15:48

score 1 · Answer 5 · edited Sep 22 '21 at 16:46

It's not adding a lot to this conversation, but after struggling with this for longer than warranted (the actual clusters are unusable), I thought I would add my implementation as another example. It's got a superimposed scatterplot (because of how annoying my dataset is), shows melt using indices, and some aesthetic tweaks. I hope this is useful for someone.

output_graph

Here it is without using column headers (I saw a different thread that wanted to know how to do this using indices):

combined_array: ndarray = np.concatenate([dbscan_output.data, dbscan_output.labels.reshape(-1, 1)], axis=1)
cluster_data_df: DataFrame = DataFrame(combined_array)

if you want to use labelled columns:
column_names: List[str] = list(outcome_variable_names)
column_names.append('cluster')
cluster_data_df.set_axis(column_names, axis='columns', inplace=True)

graph_data: DataFrame = pd.melt(
    frame=cluster_data_df,
    id_vars=['cluster'],
    # value_vars is an optional param - by default it uses columns except the id vars, but I've included it as an example
    # value_vars=['outcome_var_1', 'outcome_var_2', 'outcome_var_3', 'outcome_var_4', 'outcome_var_5', 'outcome_var_6'] 
    var_name='psychometric_test',
    value_name='standard deviations from the mean'
)

The resulting dataframe (rows = sample_n x variable_n (in my case 1626 x 6 = 9756)):

index	psychometric_tst	standard deviations from the mean
0	outcome_var_1	-1.276182
1	outcome_var_1	-1.118813
2	outcome_var_1	-1.276182
9754	outcome_var_6	0.892548
9755	outcome_var_6	1.420480

If you want to use indices with melt:

graph_data: DataFrame = pd.melt(
    frame=cluster_data_df,
    id_vars=cluster_data_df.columns[-1],
    # value_vars=cluster_data_df.columns[:-1],
    var_name='psychometric_test',
    value_name='standard deviations from the mean'
)

And here's the graphing code: (Done with column headings - just note that y-axis=value_name, x-axis = var_name, hue = id_vars):

# plot graph grouped by cluster
sns.set_theme(style="ticks")
fig = plt.figure(figsize=(10, 10))
fig.set(font_scale=1.2)
fig.set_style("white")

# create boxplot
fig.ax = sns.boxplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', showfliers=False,
                     data=graph_data)

# set box alpha:
for patch in fig.ax.artists:
    r, g, b, a = patch.get_facecolor()
    patch.set_facecolor((r, g, b, .2))

# create scatterplot
fig.ax = sns.stripplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', data=graph_data,
                       dodge=True, alpha=.25, zorder=1)

# customise legend:
cluster_n: int = dbscan_output.n_clusters
## create list with legend text
i = 0
cluster_info: Dict[int, int] = dbscan_output.cluster_sizes  # custom method
legend_labels: List[str] = []
while i < cluster_n:
    label: str = f"cluster {i+1}, n = {cluster_info[i]}"
    legend_labels.append(label)
    i += 1
if -1 in cluster_info.keys():
    cluster_n += 1
    label: str = f"Unclustered, n = {cluster_info[-1]}"
    legend_labels.insert(0, label)

## fetch existing handles and legends (each tuple will have 2*cluster number -> 1 for each boxplot cluster, 1 for each scatterplot cluster, so I will remove the first half)
handles, labels = fig.ax.get_legend_handles_labels()
index: int = int(cluster_n*(-1))
labels = legend_labels
plt.legend(handles[index:], labels[0:])
plt.xticks(rotation=45)
plt.show()

asds

Just a note: Most of my time was spent debugging the melt function. I predominantly got the error "*only integer scalar arrays can be converted to a scalar index with 1D numpy indices array*". My output required me to concatenate my outcome variable value table and the clusters (DBSCAN), and I'd put extra square brackets around the cluster array in the concat method. So I had a column where each value was an invisible List[int], rather than a plain int. It's pretty niche, but maybe it'll help someone.

List item

Grouping boxplots in seaborn when input is a DataFrame

5 Answers5

Linked

Related