64

For the given data, I want to set the outlier values (defined by 95% confidense level or 95% quantile function or anything that is required) as nan values. Following is the my data and code that I am using right now. I would be glad if someone could explain me further.

import numpy as np, matplotlib.pyplot as plt

data = np.random.rand(1000)+5.0

plt.plot(data)
plt.xlabel('observation number')
plt.ylabel('recorded value')
plt.show()
  • 3
    You know your data better but I would think that Winsorising would be better than deletion. Moreover, if set those data as nan, you then have to handle that. Take a look at [np.percentile](http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html) function. – Martin Mar 12 '14 at 14:18
  • FYI: [Detect and exclude outliers in Pandas dataframe](https://stackoverflow.com/q/23199796/395857) – Franck Dernoncourt Aug 18 '17 at 01:16

5 Answers5

153

The problem with using percentile is that the points identified as outliers is a function of your sample size.

There are a huge number of ways to test for outliers, and you should give some thought to how you classify them. Ideally, you should use a-priori information (e.g. "anything above/below this value is unrealistic because...")

However, a common, not-too-unreasonable outlier test is to remove points based on their "median absolute deviation".

Here's an implementation for the N-dimensional case (from some code for a paper here: https://github.com/joferkington/oost_paper_code/blob/master/utilities.py):

def is_outlier(points, thresh=3.5):
    """
    Returns a boolean array with True if points are outliers and False 
    otherwise.

    Parameters:
    -----------
        points : An numobservations by numdimensions array of observations
        thresh : The modified z-score to use as a threshold. Observations with
            a modified z-score (based on the median absolute deviation) greater
            than this value will be classified as outliers.

    Returns:
    --------
        mask : A numobservations-length boolean array.

    References:
    ----------
        Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
        Handle Outliers", The ASQC Basic References in Quality Control:
        Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
    """
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

This is very similar to one of my previous answers, but I wanted to illustrate the sample size effect in detail.

Let's compare a percentile-based outlier test (similar to @CTZhu's answer) with a median-absolute-deviation (MAD) test for a variety of different sample sizes:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def main():
    for num in [10, 50, 100, 1000]:
        # Generate some data
        x = np.random.normal(0, 0.5, num-3)

        # Add three outliers...
        x = np.r_[x, -3, -10, 12]
        plot(x)

    plt.show()

def mad_based_outlier(points, thresh=3.5):
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

def percentile_based_outlier(data, threshold=95):
    diff = (100 - threshold) / 2.0
    minval, maxval = np.percentile(data, [diff, 100 - diff])
    return (data < minval) | (data > maxval)

def plot(x):
    fig, axes = plt.subplots(nrows=2)
    for ax, func in zip(axes, [percentile_based_outlier, mad_based_outlier]):
        sns.distplot(x, ax=ax, rug=True, hist=False)
        outliers = x[func(x)]
        ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)

    kwargs = dict(y=0.95, x=0.05, ha='left', va='top')
    axes[0].set_title('Percentile-based Outliers', **kwargs)
    axes[1].set_title('MAD-based Outliers', **kwargs)
    fig.suptitle('Comparing Outlier Tests with n={}'.format(len(x)), size=14)

main()

enter image description here


enter image description here


enter image description here


enter image description here

Notice that the MAD-based classifier works correctly regardless of sample-size, while the percentile based classifier classifies more points the larger the sample size is, regardless of whether or not they are actually outliers.

Community
  • 1
  • 1
Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • 3
    Joe,+1,This is a great answer. Although I am wondering, if the OP's data always uniformly disturbed (`random.rand()`), or may always follow some other distribution most of the time. If the data is always uniformly disturbed, I am not sure about using `MAD`. – CT Zhu Mar 12 '14 at 16:53
  • 3
    @CTZhu - Good point, especially if the OP's data is, say, log-normally distributed. For vaguely symmetric distributions, deviations from a normal distribution shouldn't matter too much, for but strongly asymmetric distributions such as lognormal, MAD isn't a good choice. (Though you could always apply it in log-space to get around that.) All of this just serves to underscore the point that you should put some thought into whatever outlier test you choose. – Joe Kington Mar 12 '14 at 17:58
  • 6
    @JoeKington every where you are using median, but `diff` is calculated as L2 norm ( `**2` ); median is the value which minimizes L1 norm, whereas in L2 norm 'mean' is the center; i was expecting that if you start with median stay in L1 norm. do you have any reason `**2` would work better than absolute values in calculating `diff`? – behzad.nouri Mar 12 '14 at 21:15
  • @behzad.nouri - Now that you mention it, it does seem very odd to use the L2 norm for this (and it's not in fact the median absolute deviation, then). If I recall correctly, I loosely based it on a couple of implementations in various other languages that did the same thing. I'll need to take another look back at the original reference. If I just change things to the L1 norm, it doesn't work quite correctly, at any rate... – Joe Kington Mar 13 '14 at 02:56
  • @Joe Kington Thank you so much in deed. Could you please explain me if I can use the method provided in the following link for the detection of outliers?http://stats.stackexchange.com/questions/28593/mahalanobis-distance-distribution-of-multivariate-normally-distributed-points This is just for comparison purposes. –  Mar 14 '14 at 11:23
  • 1
    in your mad based outlier detection, how did you set the values 0.6745 and 3.5 for what purpose, how to determine those values? it's very confusing –  Apr 02 '14 at 08:07
  • @julie - You'll have to see the reference. They're derived there. Basically the 0.6745 is to make the values roughly equivalent in units to standard deviations and the 3.5 is the recommended threshold (roughly equivalent to 3.5 standard deviations). Sorry I haven't been replying to much lately! – Joe Kington Apr 02 '14 at 11:53
  • @julie - I only have a hardcopy of the reference, but there's a good discussion here: http://habcam.whoi.edu/HabCamData/HAB/processed/Outlier%20Methods_external.pdf – Joe Kington Apr 02 '14 at 11:56
  • So these values should be derived from our data, rather than assigning 0.6745 and 3.5 for any data, right? –  Apr 02 '14 at 13:53
  • Well, the 0.6745 should be constant, regardless of the dataset. You may want to adjust the 3.5 depending on how close or far you expect the outliers to be to the rest if the data. – Joe Kington Apr 02 '14 at 14:29
  • 1
    The modified Z-value calculated in function `mad_based_outlier` looks different from the original one in [nist.gov](http://itl.nist.gov/div898/handbook/eda/section3/eda35h.htm). The `med_abs_deviation` calculated in the code is different from the definition of the mean absolute deviation. – Duke May 08 '14 at 09:19
  • The comparative plots are VERY helpful for getting a feel for how at least these two methods differ. Thank you for gong through the effort to create them. – Gordon Bean Apr 15 '15 at 18:10
  • 1
    There is something I don't understand in is_outlier. diff is a single value (because of np.sum) - then you use np.median on diff -- is that typo? Why would you calculate the np.median of a single value. Am I missing something here? – AJed Mar 29 '16 at 11:21
  • @AJed - Notice the `axis` kwarg to `np.sum` - It's not returning a single value, it's returning the sum along the last axis, which is an array. For example: `np.sum(np.ones((3, 4)), axis=-1)` yields `array([4, 4, 4])`. – Joe Kington Mar 29 '16 at 13:53
  • @JoeKington - My bad, .. I passed an (L, ) numpy array instead of (L, 1). I didnt read the docs Thanks. – AJed Mar 30 '16 at 00:29
  • This answer (and your code) fails if more than half of the items in the set have the same value, since the median(points - median(points)) will be zero, dividing by zero will result in all np.inf. – cgnorthcutt Apr 14 '16 at 04:15
  • 2
    Alternative mirror for @JoeKington 's PDF paper http://www.pdf-archive.com/2016/07/29/outlier-methods-external/outlier-methods-external.pdf – Tiago Jul 29 '16 at 19:18
  • 1
    @JoeKington I am having trouble getting your MAD implementation to match up with MAD from `statsmodel.robust.mad` [here](http://www.statsmodels.org/stable/generated/statsmodels.robust.scale.mad.html?highlight=mad#statsmodels.robust.scale.mad). Are you aware of any discrepancy? – Kambiz Mar 08 '17 at 22:57
  • Great Answer but from where did the value : 0.6745 comes? – silent_dev Mar 20 '17 at 04:25
  • 1
    @JoeKington Could you please explain why you're using `np.sqrt(diff)` before computing the median of `diff`? – Kurt Bourbaki Aug 10 '17 at 15:30
  • 1
    @KurtBourbaki I figured out why. He basically trying to do `abs(points - median)` but he over-complicated things! – Yahya May 17 '19 at 22:59
  • 1
    @Yahya Agree - this post has an excellent explanation but the code is, unfortunately, inefficient and over-complicated. It raises all the differences to the power of 2 and then takes a sqrt from each. Why would we do this if we can simply take the absolute values? – Nick Feb 24 '20 at 19:01
18

Detection of outliers in one dimensional data depends on its distribution

1- Normal Distribution :

  1. Data values are almost equally distributed over the expected range : In this case you easily use all the methods that include mean ,like the confidence interval of 3 or 2 standard deviations(95% or 99.7%) accordingly for a normally distributed data (central limit theorem and sampling distribution of sample mean).I is a highly effective method. Explained in Khan Academy statistics and Probability - sampling distribution library.

One other way is prediction interval if you want confidence interval of data points rather than mean.

  1. Data values are are randomly distributed over a range: mean may not be a fair representation of the data, because the average is easily influenced by outliers (very small or large values in the data set that are not typical) The median is another way to measure the center of a numerical data set.

    Median Absolute deviation - a method which measures the distance of all points from the median in terms of median distance http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm - has a good explanation as explained in Joe Kington's answer above

2 - Symmetric Distribution : Again Median Absolute Deviation is a good method if the z-score calculation and threshold is changed accordingly

Explanation : http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/

3 - Asymmetric Distribution : Double MAD - Double Median Absolute Deviation Explanation in the above attached link

Attaching my python code for reference :

 def is_outlier_doubleMAD(self,points):
    """
    FOR ASSYMMETRIC DISTRIBUTION
    Returns : filtered array excluding the outliers

    Parameters : the actual data Points array

    Calculates median to divide data into 2 halves.(skew conditions handled)
    Then those two halves are treated as separate data with calculation same as for symmetric distribution.(first answer) 
    Only difference being , the thresholds are now the median distance of the right and left median with the actual data median
    """

    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    medianIndex = (points.size/2)

    leftData = np.copy(points[0:medianIndex])
    rightData = np.copy(points[medianIndex:points.size])

    median1 = np.median(leftData, axis=0)
    diff1 = np.sum((leftData - median1)**2, axis=-1)
    diff1 = np.sqrt(diff1)

    median2 = np.median(rightData, axis=0)
    diff2 = np.sum((rightData - median2)**2, axis=-1)
    diff2 = np.sqrt(diff2)

    med_abs_deviation1 = max(np.median(diff1),0.000001)
    med_abs_deviation2 = max(np.median(diff2),0.000001)

    threshold1 = ((median-median1)/med_abs_deviation1)*3
    threshold2 = ((median2-median)/med_abs_deviation2)*3

    #if any threshold is 0 -> no outliers
    if threshold1==0:
        threshold1 = sys.maxint
    if threshold2==0:
        threshold2 = sys.maxint
    #multiplied by a factor so that only the outermost points are removed
    modified_z_score1 = 0.6745 * diff1 / med_abs_deviation1
    modified_z_score2 = 0.6745 * diff2 / med_abs_deviation2

    filtered1 = []
    i = 0
    for data in modified_z_score1:
        if data < threshold1:
            filtered1.append(leftData[i])
        i += 1
    i = 0
    filtered2 = []
    for data in modified_z_score2:
        if data < threshold2:
            filtered2.append(rightData[i])
        i += 1

    filtered = filtered1 + filtered2
    return filtered
shivangi dhakad
  • 361
  • 3
  • 4
  • 2
    In Python 3, it should be `medianIndex = int(points.size/2)`. Also, if I run the code and set a threshold to zero, it crashes with the message `name 'sys' is not defined`. Laslty, the `self` in the function call never gets used. – Eulenfuchswiesel Apr 29 '19 at 13:56
  • you can also use: medianIndex = points.size//2 to avoid float value – The AG Aug 06 '20 at 19:19
16

I've adapted the code from http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers and it gives the same results as Joe Kington's, but uses L1 distance instead of L2 distance, and has support for asymmetric distributions. The original R code did not have Joe's 0.6745 multiplier, so I also added that in for consistency within this thread. Not 100% sure if it's necessary, but makes the comparison apples-to-apples.

def doubleMADsfromMedian(y,thresh=3.5):
    # warning: this function does not check for NAs
    # nor does it address issues when 
    # more than 50% of your data have identical values
    m = np.median(y)
    abs_dev = np.abs(y - m)
    left_mad = np.median(abs_dev[y <= m])
    right_mad = np.median(abs_dev[y >= m])
    y_mad = left_mad * np.ones(len(y))
    y_mad[y > m] = right_mad
    modified_z_score = 0.6745 * abs_dev / y_mad
    modified_z_score[y == m] = 0
    return modified_z_score > thresh
sergeyf
  • 1,004
  • 11
  • 10
  • How to use MAD based approach on multivariate data? The article you mentioned is great but works on single dimensional data I guess. I would like to know the simplest way to modify it to make it applicable for multivariate data as well. – exAres Nov 18 '15 at 20:12
  • 1
    There's no easy way to do this for multivariate data. A simple way is to just apply the method to one variable at a time and see if some samples are outliers in any dimensions. – sergeyf Nov 19 '15 at 22:05
  • @sergeyf How do we choose the threshold ? Read through the original post , couldn't dig it there as well . – ekta Feb 02 '16 at 10:20
  • @ekta the usual and unfortunate answer is try a bunch of different thresholds with your actual data and see what gets left out. Since this is 1-D you can do visualizations like the ones in Joe Kington's answer. If it makes it easier you can think of the threshold as sort of like the number of standard deviations. So 3.5 is a lot. But I've used numbers closer to 6 before - just depends on your data. – sergeyf Feb 09 '16 at 17:49
  • I think you should replace `y_mad[y < m]` with `y_mad[y <= m]`,and `y_mad[y > m]` with `y_mad[y >= m]`, otherwise when **y equal to m**, **y_mad** will be zero. – Jin Aug 08 '16 at 12:32
  • Thanks @Jin - I've fixed it! – sergeyf Aug 08 '16 at 18:36
  • There a problem with this algo. I'm testing with 11724, 21513, 20536, 16386, 17586, 11736, 11212, 7861, 8515, 9046, 7016, 90 set. It wont return True if the outlier either first or last element of the series. – gwthm.in Dec 19 '16 at 11:08
  • @7H3IN5ID3R In your example, the median is about 11k, and 20k is as far from 11k as 90 is, so I am not sure what the outlier should actually be here. These are all the sorted actual absolute deviations from median: `[256.0, 256.0, 268.0, 2422.0, 2953.0, 3607.0, 4452.0, 4918.0, 6118.0, 9068.0, 10045.0, 11378.0]`. Which of these are outliers in your thought? – sergeyf Dec 21 '16 at 14:17
  • I don't think there is a outlier there in that series. I'm really not sure about it. I thought outlier is something which deviates from the series. My problem is, I have to find either sudden drop or raise of a data series. – gwthm.in Dec 21 '16 at 15:34
  • @sergeyf shouldn't modified_z_score = 0.6745 * abs_dev / y_mad be 0.6745*(y - m)/y_mad. Because the actual formula is 0.6745(mean(x) - x)/MAD, so i don't think we should multiply abs_dev with 0.6745. – The AG Sep 23 '20 at 21:18
  • @TheAG why "0.6745*(y - m)/y_mad"? the numerator is currently "abs_dev = np.abs(y - m)". Your suggestion is to remove the "abs"? – sergeyf Sep 24 '20 at 22:21
  • yes buddy.. As per formula, we take the difference of Mi = 0.6745*(Yi - Ymedian) / MAD... One of the article: https://support.infogix.com/hc/en-us/community/posts/360028941133-Outlier-Detection-Using-Modifed-Z-Score – The AG Sep 25 '20 at 14:31
  • 1
    @TheAG Ah I see what you're saying. The reason I made it absolute is that we don't care if the outlier is in the right tail or the left tail. But if you care, then removing the absolute value makes sense! – sergeyf Sep 26 '20 at 17:16
  • gotcha..!! Thank you @sergeyf – The AG Sep 28 '20 at 17:48
4

Well a simple solution can also be, removing something which outside 2 standard deviations(or 1.96):

import random
def outliers(tmp):
    """tmp is a list of numbers"""
    outs = []
    mean = sum(tmp)/(1.0*len(tmp))
    var = sum((tmp[i] - mean)**2 for i in range(0, len(tmp)))/(1.0*len(tmp))
    std = var**0.5
    outs = [tmp[i] for i in range(0, len(tmp)) if abs(tmp[i]-mean) > 1.96*std]
    return outs


lst = [random.randrange(-10, 55) for _ in range(40)]
print lst
print outliers(lst)
shantanuo
  • 31,689
  • 78
  • 245
  • 403
jimseeve
  • 141
  • 1
  • 7
3

Use np.percentile as @Martin suggested:

percentiles = np.percentile(data, [2.5, 97.5])

# or =>, <= for within 95%
data[(percentiles[0]<data) & (percentiles[1]>data)]

# set the outliners to np.nan
data[(percentiles[0]>data) | (percentiles[1]<data)] = np.nan
bugmenot123
  • 1,069
  • 1
  • 18
  • 33
CT Zhu
  • 52,648
  • 17
  • 120
  • 133
  • 1
    Using percentiles of the data as an outlier test is a reasonable first pass, but it's not ideal. The problem is 1) that you'll remove some data, even if it's not an outlier, and 2) the outliers heavily influence the variance, and therefore the percentile values. The most common outlier tests use "median absolute deviation" which is less sensitive to the presence of outliers. – Joe Kington Mar 12 '14 at 15:03
  • @Joe Kington I would be grateful if you could implement your way using python code. –  Mar 12 '14 at 15:13
  • @Joe Kington i saw the link you answered. However, is not there a more easier way to make it primarily using the functions available in numpy –  Mar 12 '14 at 15:27
  • 1
    @julie - That function extensively uses numpy (it requires a numpy array as input and outputs a numpy array). An outlier test is _far_ beyond the scope of `numpy`. (`numpy` itself just contains the core data structure and a few basic operations. It's deliberately small.) You can make an argument that `scipy.stats` would be a reasonable place for an outlier test, but there are many of them, and there's no single best test. Therefore, there isn't currently a single-function outlier test. – Joe Kington Mar 12 '14 at 15:30
  • 1
    Statsmodels has a median-absolute deviation function in `sm.robust.mad`. I'm not sure there are facilities for univariate outlier tests, but there are for influence/outliers in a regression framework. Will see about adding some tools for univariate outlier detection. – jseabold Mar 12 '14 at 16:17
  • @jseabold - Didn't know that was there! Thanks! – Joe Kington Mar 12 '14 at 16:26