44

I'm plotting some data from various tests. Sometimes in a test I happen to have one outlier (say 0.1), while all other values are three orders of magnitude smaller.

With matplotlib, I plot against the range [0, max_data_value]

How can I just zoom into my data and not display outliers, which would mess up the x-axis in my plot?

Should I simply take the 95 percentile and have the range [0, 95_percentile] on the x-axis?

Ricky Robinson
  • 21,798
  • 42
  • 129
  • 185

5 Answers5

77

There's no single "best" test for an outlier. Ideally, you should incorporate a-priori information (e.g. "This parameter shouldn't be over x because of blah...").

Most tests for outliers use the median absolute deviation, rather than the 95th percentile or some other variance-based measurement. Otherwise, the variance/stddev that is calculated will be heavily skewed by the outliers.

Here's a function that implements one of the more common outlier tests.

def is_outlier(points, thresh=3.5):
    """
    Returns a boolean array with True if points are outliers and False 
    otherwise.

    Parameters:
    -----------
        points : An numobservations by numdimensions array of observations
        thresh : The modified z-score to use as a threshold. Observations with
            a modified z-score (based on the median absolute deviation) greater
            than this value will be classified as outliers.

    Returns:
    --------
        mask : A numobservations-length boolean array.

    References:
    ----------
        Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
        Handle Outliers", The ASQC Basic References in Quality Control:
        Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
    """
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

As an example of using it, you'd do something like the following:

import numpy as np
import matplotlib.pyplot as plt

# The function above... In my case it's in a local utilities module
from sci_utilities import is_outlier

# Generate some data
x = np.random.random(100)

# Append a few "bad" points
x = np.r_[x, -3, -10, 100]

# Keep only the "good" points
# "~" operates as a logical not operator on boolean numpy arrays
filtered = x[~is_outlier(x)]

# Plot the results
fig, (ax1, ax2) = plt.subplots(nrows=2)

ax1.hist(x)
ax1.set_title('Original')

ax2.hist(filtered)
ax2.set_title('Without Outliers')

plt.show()

enter image description here

Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • This is a great answer (+1 from me), but I think '~' is a bitwise not, not a logical not - seems not matter here for reasons I'm not 100% clear about, but in other places it would. `~False != True`, but `not False == True` – Will Dean Nov 13 '12 at 13:24
  • 1
    Good point! In numpy, it's overloaded to operate as logical not on boolean arrays (e.g. `~np.array(False) == True`), but this isn't the case for anything else. I should clarify that. (On a side note, by convention `not some_array` will raise a value error if `some_array` has more than one element. Thus the need for `~` in the example above.) – Joe Kington Nov 14 '12 at 12:58
  • Thanks for the response - I actually tried 'not' and got the error you predict, so I was even more mystified... – Will Dean Nov 14 '12 at 13:45
  • 3
    This breaks when the median deviation is zero. That happened to me when I naively loaded a data set in with more than 50% zeros. – Wesley Tansey Mar 22 '14 at 12:58
  • @WesleyTansey did you find a good solution to deal with the devision by 0 errors? I'm currently working through the same problem. – The2ndSon Mar 02 '16 at 23:27
  • I think I ended up just taking the minimum non-zero deviation in that case. It worked well for my edge case. – Wesley Tansey Mar 03 '16 at 00:14
  • I borrowed code from Joe and [this guy](https://edwinth.github.io/blog/outlier-bin/) to make a function that'll plot all your data in a histogram and indicate the outliers in the extreme bins [here](https://stackoverflow.com/a/51050772/8493081). – Benjamin Doughty Jun 26 '18 at 20:25
  • look at this too, removing vertices of convex hull over all the points. specifically useful for scatterplot case: http://www.nbertagnolli.com/jekyll/update/2016/01/30/Visualize_Covariance.html – Ash Jun 30 '18 at 00:37
  • @JoeKington , thanks and Please add information, tips of clue about this two magic number ( 0.6745 & thresh=3.5 ). What could be those if some one is using 90% CI, or 95 % CI,or 75 % CI – user2458922 Jan 24 '23 at 20:21
  • Is it possible to modify the is_outlier() function so that it accepts 'DataFrameGroupBy' object instead of an array? I tried defining the function for series and adding a series.to_numpy() at the very start of the function and it works great when i give it a series instead of an array. However, as soon as i try to pass it on in an aggregate (because i want to detect the outliers groupby a column then it gives me ''DataFrameGroupBy' object has no attribute 'to_numpy' error – Ankhnesmerira Mar 24 '23 at 03:49
13

If you aren't fussed about rejecting outliers as mentioned by Joe and it is purely aesthetic reasons for doing this, you could just set your plot's x axis limits:

plt.xlim(min_x_data_value,max_x_data_value)

Where the values are your desired limits to display.

plt.ylim(min,max) works to set limits on the y axis also.

Jdog
  • 10,071
  • 4
  • 25
  • 42
  • 5
    For a histogram, though, the OP would also need to recalculate the bins. Matplotlib uses fixed bin edges. It doesn't "rebin" when you zoom in. – Joe Kington Aug 09 '12 at 15:25
12

I think using pandas quantile is useful and much more flexible.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

pd_series = pd.Series(np.random.normal(size=300)) 
pd_series_adjusted = pd_series[pd_series.between(pd_series.quantile(.05), pd_series.quantile(.95))] 

ax1.boxplot(pd_series)
ax1.set_title('Original')

ax2.boxplot(pd_series_adjusted)
ax2.set_title('Adjusted')

plt.show()

enter image description here

jaga
  • 21
  • 1
  • 5
Zstack
  • 4,046
  • 1
  • 19
  • 22
9

I usually pass the data through the function np.clip, If you have some reasonable estimate of the maximum and minimum value of your data, just use that. If you don't have a reasonable estimate, the histogram of clipped data will show you the size of the tails, and if the outliers are really just outliers the tail should be small.

What I run is something like this:

import numpy as np
import matplotlib.pyplot as plt

data = np.random.normal(3, size=100000)
plt.hist(np.clip(data, -15, 8), bins=333, density=True)

You can compare the results if you change the min and max in the clipping function until you find the right values for your data.

Example

In this example, you can see immediately that the max value of 8 is not good because you are removing a lot of meaningful information. The min value of -15 should be fine since the tail is not even visible.

You could probably write some code that based on this find some good bounds that minimize the sizes of the tails according to some tolerance.

Jorge E. Cardona
  • 92,161
  • 3
  • 37
  • 44
3

In some cases (e.g. in histogram plots such as the one in Joe Kington's answer) rescaling the plot could show that the outliers exist but that they have been partially cropped out by the zoom scale. Removing the outliers would not have the same effect as just rescaling. Automatically finding appropriate axes limits seems generally more desirable and easier than detecting and removing outliers.

Here's an autoscale idea using percentiles and data-dependent margins to achieve a nice view.

import numpy as np
import matplotlib.pyplot as plt    

# xdata = some x data points ...
# ydata = some y data points ...

# Finding limits for y-axis     
ypbot = np.percentile(ydata, 1)
yptop = np.percentile(ydata, 99)
ypad = 0.2*(yptop - ypbot)
ymin = ypbot - ypad
ymax = yptop + ypad

Example usage:

fig = plt.figure(figsize=(6, 8))

ax1 = fig.add_subplot(211)
ax1.scatter(xdata, ydata, s=1, c='blue')
ax1.set_title('Original')
ax1.axhline(y=0, color='black')

ax2 = fig.add_subplot(212)
ax2.scatter(xdata, ydata, s=1, c='blue')
ax2.axhline(y=0, color='black')
ax2.set_title('Autscaled')
ax2.set_ylim([ymin, ymax])

plt.show()

enter image description here

FNia
  • 173
  • 6