22

I have a Pandas dataset named titanic I am plotting a bar chart as described in the Seaborn official documentation, using the following code:

import seaborn as sns

titanic = sns.load_dataset("titanic")
sns.catplot(x="sex", y="survived", hue="class", kind="bar", data=titanic)

This produces the following plot:

barplot by categories

As you can see, the hue is represented by the class. How can I manually choose the hue order so that I can reverse the current one?

cottontail
  • 10,268
  • 18
  • 50
  • 51
Marioanzas
  • 1,663
  • 2
  • 10
  • 33

2 Answers2

46

In order to manually select the hue order of a Seaborn plot, you have to define the desired order as a list and then pass it to the plot function as the argument hue_order . The following code would work:

import seaborn as sns

titanic = sns.load_dataset("titanic")
hue_order = ['Third', 'Second', 'First']
sns.catplot(x="sex", y="survived", hue="class", data=titanic, hue_order=hue_order, kind="bar")
Marioanzas
  • 1,663
  • 2
  • 10
  • 33
0

hue_order is really useful if it has to be dynamically set. For example, the following code orders hue in ascending survival rate by class.

df = sns.load_dataset('titanic')
# survival rate by class in ascending order
hue_order = df.groupby('class')['survived'].mean().sort_values().index
sns.catplot(x="sex", y="survived", hue="class", data=df, hue_order=hue_order, kind="bar");

As the source code shows, if the data used for hue defines a categories attribute (e.g. pandas category dtype), the hue_order is set by these categories (however it is ordered); otherwise, it is set by the order that unique values appear in the data if it's non-numeric (such as class in the OP) or the sorted unique values if numeric.1

So another way to order hue is to change the order of the categories in the source dataframe.2

# convert to a categorical variable with the categories in the "correct" order
df['class'] = pd.Categorical(df['class'], categories=['Third', 'Second', 'First'])
sns.catplot(x="sex", y="survived", hue="class", data=df, kind="bar");

result


1 Using the titanic dataset built into seaborn, we can see these differences. The order is determined differently for each dtype.

# category dtype
df['class'].dtype        # CategoricalDtype(categories=['First', 'Second', 'Third'], ordered=False)
g = sns.catplot(x='sex', y='survived', hue='class', data=df, kind='bar')
(df['class'].cat.categories == g.axes[0][0].get_legend_handles_labels()[1]).all()        # True
#           ^^^^^^^^^^^^^^^  <---- categories determine the order

# for object dtype
df['embarked'].dtype     # dtype('O')
g = sns.catplot(x='sex', y='survived', hue='embarked', data=df, kind='bar')
(df['embarked'].dropna().unique() == g.axes[0][0].get_legend_handles_labels()[1]).all()  # True
#               ^^^^^^^^^^^^^^^  <---- non-NaN unique values (in the order they first appear in the data) determine the order

# numeric dtype
df['pclass'].dtype       # dtype('int64')
g = sns.catplot(x='sex', y='survived', hue='pclass', data=df, kind='bar')
(np.sort(df['pclass'].dropna().unique()).astype(str) == g.axes[0][0].get_legend_handles_labels()[1]).all()  # True
#  ^^^^^              ^^^^^^^^^^^^^^^  <---- sorted non-NaN unique values determine the order

2 As a side note, you probably will want to convert all columns that contain categorical data into category dtype anyway. Among other desirable properties, it is much more memory efficient than the alternatives. For example, sex column consumes 6 times less space in memory if it's of dtype category instead of object.

df['sex'].memory_usage()                     # 7256
df['sex'].astype('category').memory_usage()  # 1143

df['pclass'].memory_usage()                                                   # 7256
df['pclass'].astype(pd.CategoricalDtype(categories=[3, 2, 1])).memory_usage() # 1151
#                   ^^^^^^^^^^^^^^^^^^^ categories in a specific order
cottontail
  • 10,268
  • 18
  • 50
  • 51