How to get outlier values for a specific category with boxplot_stats

Question

I'm trying to extract the outliers using a boxplot.

# libraries & dataset
import seaborn as sns
import matplotlib.pyplot as plt
# set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above) 
sns.set(style="darkgrid")
df = sns.load_dataset('iris')

sns.boxplot(y=df["species"], x=df["sepal_length"])
plt.show()

The above plot shows an outlier. I tried to extract the outliers using boxplot_stats. But the fliers show an empty array.

from matplotlib.cbook import boxplot_stats  
boxplot_stats(data["sepal_length"])

Output

[{'cihi': 5.966646952167348,
  'cilo': 5.633353047832651,
  'fliers': array([], dtype=float64),
  'iqr': 1.3000000000000007,
  'mean': 5.843333333333334,
  'med': 5.8,
  'q1': 5.1,
  'q3': 6.4,
  'whishi': 7.9,
  'whislo': 4.3}]

Is there a way to extract the outlier shown in the boxplot?

Trenton McKinney · Accepted Answer · 2021-08-27T16:47:59.357

1

The 'species' needs to be specified.
- boxplot_stats(data["sepal_length"]) is the statistics for all 'species'.
- Use .loc and Boolean indexing to select the correct category.
This answer shows how to make the calculation using pandas methods.
The Notes section of matplotlib.pyplot.boxplot shows how outliers are calculated.
Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.3, seaborn 0.11.2

from matplotlib import boxplot_stats
import seaborn as sns

# load the data
df = sns.load_dataset('iris')

boxplot_stats(df.loc[df.species.eq('virginica'), "sepal_length"])

[{'mean': 6.587999999999998,
  'iqr': 0.6750000000000007,
  'cilo': 6.350128717727511,
  'cihi': 6.649871282272489,
  'whishi': 7.9,
  'whislo': 5.6,
  'fliers': array([4.9]),
  'q1': 6.225,
  'med': 6.5,
  'q3': 6.9}]

Get all outliers

for species, data in df.groupby('species'):
    data = data.iloc[:, :-1]  # drop off the species column
    print(f'Outliers for: {species}')
    stats = boxplot_stats(data)
    for col, stat in zip(data.columns, stats):
        print(f"{col}: {stat['fliers'].tolist()}")
    print('\n')

[out]:
Outliers for: setosa
sepal_length: []
sepal_width: [2.3, 4.4]
petal_length: [1.1, 1.0, 1.9, 1.9]
petal_width: [0.5, 0.6]


Outliers for: versicolor
sepal_length: []
sepal_width: []
petal_length: [3.0]
petal_width: []


Outliers for: virginica
sepal_length: [4.9]
sepal_width: [2.2, 3.8, 3.8]
petal_length: []
petal_width: []

`seaborn.catplot`

sns.catplot(kind='box', data=df.melt(id_vars='species'), x='value', y='variable', hue='species', aspect=1.5)

edited Aug 27 '21 at 16:47

answered Aug 26 '21 at 17:37

Trenton McKinney

56,955
33
144
158

Does Filtering the fliers returns all the outliers? Like `df[df["sepal_length"] == 4.9]` – Ailurophile Aug 27 '21 at 04:17
What you wrote returns all rows where sepal_length equals 4.9. My answer returns the stats only for sepal_length for the specific species, which includes all fliers. I’m going to bed now and won’t respond until after 9am PST (UTC-7) – Trenton McKinney Aug 27 '21 at 05:04
How to return rows of outliers? Please respond when u r available. – Ailurophile Aug 27 '21 at 05:15
1

@Pluviophile this question [how to use pandas filter with IQR](https://stackoverflow.com/q/34782063/7758804) shows how to filter the dataframe. In the example dataframe, there are 3 species, each of which have 4 metrics. The calculation needs to be done for each group, because they all have different metrics. – Trenton McKinney Aug 27 '21 at 11:50
I should filter each group and apply IQR and filter the data right? – Ailurophile Aug 27 '21 at 16:23
@Pluviophile As far as I can tell that is what needs to happen. If you have additional issues applying those methods to your dataframe, you should open a new question, and be sure to reference the other question. – Trenton McKinney Aug 27 '21 at 16:26

How to get outlier values for a specific category with boxplot_stats

1 Answers1

Get all outliers

seaborn.catplot

`seaborn.catplot`