1

I'm trying to extract the outliers using a boxplot.

# libraries & dataset
import seaborn as sns
import matplotlib.pyplot as plt
# set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above) 
sns.set(style="darkgrid")
df = sns.load_dataset('iris')

sns.boxplot(y=df["species"], x=df["sepal_length"])
plt.show()

enter image description here

The above plot shows an outlier. I tried to extract the outliers using boxplot_stats. But the fliers show an empty array.

from matplotlib.cbook import boxplot_stats  
boxplot_stats(data["sepal_length"])

Output

[{'cihi': 5.966646952167348,
  'cilo': 5.633353047832651,
  'fliers': array([], dtype=float64),
  'iqr': 1.3000000000000007,
  'mean': 5.843333333333334,
  'med': 5.8,
  'q1': 5.1,
  'q3': 6.4,
  'whishi': 7.9,
  'whislo': 4.3}]

Is there a way to extract the outlier shown in the boxplot?

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Ailurophile
  • 2,552
  • 7
  • 21
  • 46

1 Answers1

1
  • The 'species' needs to be specified.
    • boxplot_stats(data["sepal_length"]) is the statistics for all 'species'.
    • Use .loc and Boolean indexing to select the correct category.
  • This answer shows how to make the calculation using pandas methods.
  • The Notes section of matplotlib.pyplot.boxplot shows how outliers are calculated.
  • Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.3, seaborn 0.11.2
from matplotlib import boxplot_stats
import seaborn as sns

# load the data
df = sns.load_dataset('iris')

boxplot_stats(df.loc[df.species.eq('virginica'), "sepal_length"])

[{'mean': 6.587999999999998,
  'iqr': 0.6750000000000007,
  'cilo': 6.350128717727511,
  'cihi': 6.649871282272489,
  'whishi': 7.9,
  'whislo': 5.6,
  'fliers': array([4.9]),
  'q1': 6.225,
  'med': 6.5,
  'q3': 6.9}]

Get all outliers

for species, data in df.groupby('species'):
    data = data.iloc[:, :-1]  # drop off the species column
    print(f'Outliers for: {species}')
    stats = boxplot_stats(data)
    for col, stat in zip(data.columns, stats):
        print(f"{col}: {stat['fliers'].tolist()}")
    print('\n')

[out]:
Outliers for: setosa
sepal_length: []
sepal_width: [2.3, 4.4]
petal_length: [1.1, 1.0, 1.9, 1.9]
petal_width: [0.5, 0.6]


Outliers for: versicolor
sepal_length: []
sepal_width: []
petal_length: [3.0]
petal_width: []


Outliers for: virginica
sepal_length: [4.9]
sepal_width: [2.2, 3.8, 3.8]
petal_length: []
petal_width: []

seaborn.catplot

sns.catplot(kind='box', data=df.melt(id_vars='species'), x='value', y='variable', hue='species', aspect=1.5)

enter image description here

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • Does Filtering the fliers returns all the outliers? Like `df[df["sepal_length"] == 4.9]` – Ailurophile Aug 27 '21 at 04:17
  • What you wrote returns all rows where sepal_length equals 4.9. My answer returns the stats only for sepal_length for the specific species, which includes all fliers. I’m going to bed now and won’t respond until after 9am PST (UTC-7) – Trenton McKinney Aug 27 '21 at 05:04
  • How to return rows of outliers? Please respond when u r available. – Ailurophile Aug 27 '21 at 05:15
  • 1
    @Pluviophile this question [how to use pandas filter with IQR](https://stackoverflow.com/q/34782063/7758804) shows how to filter the dataframe. In the example dataframe, there are 3 species, each of which have 4 metrics. The calculation needs to be done for each group, because they all have different metrics. – Trenton McKinney Aug 27 '21 at 11:50
  • I should filter each group and apply IQR and filter the data right? – Ailurophile Aug 27 '21 at 16:23
  • @Pluviophile As far as I can tell that is what needs to happen. If you have additional issues applying those methods to your dataframe, you should open a new question, and be sure to reference the other question. – Trenton McKinney Aug 27 '21 at 16:26