0

As I go through online tutorials and\or articles in general, when I encounter a plot that uses the Seaborn distplot plot I re-create it using either histplot or displot.

I do this because distplot is deprecated and I want to re-write the code using newer standards.

I am going through this article: https://www.kite.com/blog/python/data-analysis-visualization-python/

and there is a section using distplot whose output I cannot replicate.

This is the section of code that I am trying to replicate:

col_names = ['StrengthFactor', 'PriceReg', 'ReleaseYear', 'ItemCount', 'LowUserPrice', 'LowNetPrice']
fig, ax = plt.subplots(len(col_names), figsize=(8, 40))
for i, col_val in enumerate(col_names):
    x = sales_data_hist[col_val][:1000]
    sns.distplot(x, ax=ax[i], rug=True, hist=False)
    outliers = x[percentile_based_outlier(x)]
    ax[i].plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)

    ax[i].set_title('Outlier detection - {}'.format(col_val), fontsize=10)
    ax[i].set_xlabel(col_val, fontsize=8)

plt.show()

Both the distplot itself and the axis variable are no longer used. The code, for now, runs.

In a nutshell, all I am trying to do is replicate the exact output of the code above (rug plot, the red dots representing the removed values, etc.) without using deprecated code.

I have tried various combinations of displot and histplot but I have been unable to get the exact same output any other way.

JohanC
  • 71,591
  • 8
  • 33
  • 66
MarkS
  • 1,455
  • 2
  • 21
  • 36
  • 1
    It seems like `displot` should be able to replicate this. To save others duplicating your efforts, what have you already tried, and how is it different from the original output? – tmdavison Aug 16 '21 at 14:04
  • The closest I have come is this: ```sns.displot(x, ax=ax[i], rug=True)``` I get the rug, but it's a histogram, not a kde plot, which I don't understand at all since I am using displot. The red dots representing the removed values are gone, and I can't figure out how to get rid of the axis option without completely breaking the output. Histplot has no rug option. – MarkS Aug 16 '21 at 14:42
  • Using the above line, I get these warnings (and I do not replicate the code posted in my question): ```C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\seaborn\distributions.py:2164: UserWarning: `displot` is a figure-level function and does not accept the ax= paramter. You may wish to try histplot. warnings.warn(msg, UserWarning)``` – MarkS Aug 16 '21 at 14:49
  • 1
    Yes, the documentation states `displot` is a figure level function, not an axes level function, so you can't tell it which axes to plot on. I believe you could, instead, use `kdeplot()` and `rugplot()` to get the result you want (I think that is what `displot` is doing under the hood anyway). Additionally, if you want to keep using `displot`, you can use `kind='kde'` to get a kde plot rather than a histogram. – tmdavison Aug 16 '21 at 16:51

1 Answers1

1

The sns.kdeplot() function shows the kde curve available in distplot. (In fact, distplot just calls kdeplot internally). Similarly, there is sns.rugplot() to show the rug.

Here is an example with the easier to replicate iris dataset:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

def percentile_based_outlier(data, threshold=95):
    diff = (100 - threshold) / 2
    minval, maxval = np.percentile(data, [diff, 100 - diff])
    return (data < minval) | (data > maxval)

iris = sns.load_dataset('iris')
col_names = [col for col in iris.columns if iris[col].dtype == 'float64']  # the numerical columns
fig, axs = plt.subplots(len(col_names), figsize=(5, 12))
for ax, col_val in zip(axs, col_names):
    x = iris[col_val]
    sns.kdeplot(x, ax=ax)
    sns.rugplot(x, ax=ax, color='C0')
    outliers = x[percentile_based_outlier(x)]
    ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)

    ax.set_title(f'Outlier detection - {col_val}', fontsize=10)
    ax.set_xlabel('')  # ax[i].set_xlabel(col_val, fontsize=8)
plt.tight_layout()
plt.show()

emulating sns.displot(hist=False, rug=True)

To use displot, the dataframe can be converted to "long form" via pd.melt(). The outliers can be added via a custom function called by g.map_dataframe(...):

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

def percentile_based_outlier(data, threshold=95):
    diff = (100 - threshold) / 2
    minval, maxval = np.percentile(data, [diff, 100 - diff])
    return (data < minval) | (data > maxval)

def show_outliers(data, color):
    col_name = data['variable'].values[0]
    x = data['value'].to_numpy()
    outliers = x[percentile_based_outlier(x)]
    plt.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)
    plt.xlabel('')

iris = sns.load_dataset('iris')
col_names = [col for col in iris.columns if iris[col].dtype == 'float64']  # the numerical columns
iris_long = iris.melt(value_vars=col_names)
g = sns.displot(data=iris_long, x='value', kind='kde', rug=True, row='variable',
                height=2.2, aspect=3,
                facet_kws={'sharey': False, 'sharex': False})
g.map_dataframe(show_outliers)

displot with outliers

JohanC
  • 71,591
  • 8
  • 33
  • 66