0

I have a snippet of python to create a box plot as follows (works great):

merged = group.merge(t, left_on=t['user_lower'], right_on=group['user'], how="left")
g = sns.boxplot(x="Company", y="Total_Activities",data=merged, orient="v" )
g.set_xticklabels(g.get_xticklabels(),rotation=90)
plt.show(g)

I've read in other posts that this involved iterating over the outliers. Does anyone have an example of this for a merged dataset using Seaborn?

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
user3425900
  • 65
  • 2
  • 9
  • 1
    Same question as [this one](https://stackoverflow.com/questions/40470175/boxplot-outliers-labels-python). The other at least provides a [mcve]. – ImportanceOfBeingErnest Jan 23 '18 at 17:14
  • Seaborn makes it especially hard to manipulate once created plots. Using matplotlib would be easier here, because the matplotlib boxplot function directly returns the fliers, so one can reuse them. Is that an option for you? In any case, providing a [mcve] of the issue and clearly stating what kind of label you want is necessary here, better also state in how far this issue cannot be solved by other questions, like e.g. [this one](https://stackoverflow.com/questions/45354215/matplotlib-boxplot-showing-number-of-occurrences-of-integer-outliers). – ImportanceOfBeingErnest Jan 23 '18 at 17:24

1 Answers1

0

I used this workaround to get x-coordinates of the outliers in the box plot axes, which I could use to label them as needed. The dataframe index is found by selecting the outliers in the same way which the sns box plot uses

import seaborn as sns

tips = sns.load_dataset("tips")
ax = sns.boxplot(x="day", y="total_bill", hue="smoker",
                 data=tips, palette="Set3")

plt_outliers_xy = []
for line in ax.get_lines():
    x_data,y_data = line.get_data()
    if line.get_marker() != 'd' or len(y_data) == 0:
        continue
    for x_val,y_val in zip(x_data,y_data):
        plt_outliers_xy.append((x_val,y_val))

grp = tips.groupby(['day','smoker'])

for name, df in grp:
    print(name)
    y_vals = df["total_bill"]
    Q1 = y_vals.quantile(0.25)
    Q3 = y_vals.quantile(0.75)
    IQR = Q3 - Q1    #IQR is interquartile range. 
    iqr_filter = (y_vals >= Q1 - 1.5 * IQR) & (y_vals <= Q3 + 1.5 *IQR) 
    dropped = y_vals.loc[~iqr_filter]
    for index,y_i in dropped.iteritems():
        x_plt, y_plt =  plt_outliers_xy.pop(0)
        print(f"{index} : {y_i:.4f} - {y_plt:.4f} = {y_i-y_plt:.4f}")
#        ax.plot(x_plt, y_plt,'ro')
        ax.annotate(f"{index}",(x_plt, y_plt),(10,10), textcoords = 'offset pixels')
    print()

The outliers per grouped data can be obtained with: https://datascience.stackexchange.com/questions/54808/how-to-remove-outliers-using-box-plot

Or: Extract outliers from Seaborn Boxplot

Or: https://nextjournal.com/schmudde/how-to-remove-outliers-in-data

The plot result: Seaborn box plot with annotated outliers