0

I have an issue when plotting a categorical grouped boxplot by seaborn in Python, especially using 'hue'.

My raw data is as shown in the figure below. And I wanted to plot values in column 8 after categorized by column 1 and 4. snapshot of my raw data

I used seaborn and my code is shown below:

ax = sns.boxplot(x=output[:,1], y=output[:,8], hue=output[:,4])
ax.set_xticklabel(ax.get_xticklabels(), rotation=90)
plt.legend([],[])

However, the generated plot always contains large blank area, as shown in the upper figure below. I tried to add 'dodge=False' in sns.boxplot according to a post here (https://stackoverflow.com/questions/53641287/off-center-x-axis-in-seaborn), but it gives the lower figure below.

Categorical boxplot generated by seaborn with 'dodge=True' (upper) and 'dodge=False' (lower), respectively.

Actually, what I want Python to plot is a boxplot like what I generated using JMP below. Boxplot generated by JMP using the same raw data

It seems that if one of the 2nd categories is empty, seaborn will still leave the space on the generated figure for each 1st category, thus causes the observed off-set/blank area.

So I wonder if there is any way to solve this issue, like using other package in python?

JohanC
  • 71,591
  • 8
  • 33
  • 66
Enlong Liu
  • 19
  • 1

1 Answers1

0

Seaborn reserves a spot for each individual hue value, even when some of these values are missing. When many hue values are missing, this leads to annoying open spots. (When there would be only one box per x-value, dodge=False would solve the problem.)

A workaround is to generate a separate subplot for each individual x-label.

Reproducible example for default boxplot with missing hue values

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

np.random.seed(20230206)
df = pd.DataFrame({'label': np.repeat(['label1', 'label2', 'label3', 'label4'], 250),
                   'cat': np.repeat(np.random.choice([*'abcdefghijklmnopqrst'], 40), 25),
                   'value': np.random.randn(1000).cumsum()})
df['cat'] = pd.Categorical(df['cat'], [*'abcdefghijklmnopqrst'])
sns.set_style('white')
plt.figure(figsize=(15, 5))
ax = sns.boxplot(df, x='label', y='value', hue='cat', palette='turbo')
sns.move_legend(ax, loc='upper left', bbox_to_anchor=(1, 1), ncol=2)
sns.despine()
plt.tight_layout()
plt.show()

default sns.boxplot with missing hue values

Individual subplots per x value

  • A FacetGrid is generated with a subplot ("facet") for each x value
  • The original hue will be used as x-value for each subplot. To avoid empty spots, the hue should be of string type. When the hue would be pd.Categorical, seaborn would still reserve a spot for each of the categories.
df['cat'] = df['cat'].astype(str)  # the column should be of string type, not pd.Categorical
g = sns.FacetGrid(df, col='label', sharex=False)
g.map_dataframe(sns.boxplot, x='cat', y='value')
for label, ax in g.axes_dict.items():
    ax.set_title('')  # remove the title generated by sns.FacetGrid
    ax.set_xlabel(label)  # use the label from the dataframe as xlabel
plt.tight_layout()
plt.show()

individual subplots for sns.boxplot per x value

Adding consistent coloring

A dictionary palette can color the boxes such that corresponding boxes in different subplots have the same color. hue= with the same column as the x= will do the coloring, and dodge=False will remove the empty spots.

df['cat'] = df['cat'].astype(str)  # the column should be of string type, not pd.Categorical
cats = np.sort(df['cat'].unique())
palette_dict = {cat: color for cat, color in zip(cats, sns.color_palette('turbo', len(cats)))}
g = sns.FacetGrid(df, col='label', sharex=False)
g.map_dataframe(sns.boxplot, x='cat', y='value',
                hue='cat', dodge=False, palette=palette_dict)
for label, ax in g.axes_dict.items():
    ax.set_title('')  # remove the title generated by sns.FacetGrid
    ax.set_xlabel(label)  # use the label from the dataframe as xlabel
    # ax.tick_params(axis='x', labelrotation=90) # optionally rotate the tick labels
plt.tight_layout()
plt.show()

individual subplots with coloring

JohanC
  • 71,591
  • 8
  • 33
  • 66
  • Thanks for your reply. But I am still confused about the 2nd part, since the 4 generated figures have the same 2nd x label. But from the 1st and 3rd part, in each label, data spot are different, so we should expect in the generated figure in the 2 part of your reply to have different 2nd x label? – Enlong Liu Feb 06 '23 at 09:12
  • 1
    Many thanks for your feedback. The default `sharex=True` caused all ticks to be equal. And setting `order=cats` also obliges all categories to be present everywhere. Both problems have been updated now. (To have a consistent order of the ticks, the original dataframe could be sorted on the 'cat' column.) – JohanC Feb 06 '23 at 09:49
  • Thanks a lot! That is what I exactly want. I will try to adjust my code according to your reply. – Enlong Liu Feb 06 '23 at 10:03
  • It has been answered very well. But in my actuall case, I still get an error when running the code 'sns. FacetGrid', saying 'arrays used as indices must be of integer (or boolean)' type. But I don't think it relates to the issue in this question, I will try to figure it out. – Enlong Liu Feb 06 '23 at 10:26
  • If your column names are really numbers, as in your example code, instead of `sns.boxplot(x=output[:,1], y=output[:,8], hue=output[:,4])`, you should write `sns.boxplot(data=output, x=1, y=8, hue=4)`. And use `FacetGrid` in a similar way. Better yet, you could give recognizable names to the columns. E.g. `output.columns = ['ID', 'when', 'time', ....]` and use those names in Seaborn. – JohanC Feb 06 '23 at 10:32