135

Is there a numpy builtin to do something like the following? That is, take a list d and return a list filtered_d with any outlying elements removed based on some assumed distribution of the points in d.

import numpy as np

def reject_outliers(data):
    m = 2
    u = np.mean(data)
    s = np.std(data)
    filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
    return filtered

>>> d = [2,4,5,1,6,5,40]
>>> filtered_d = reject_outliers(d)
>>> print filtered_d
[2,4,5,1,6,5]

I say 'something like' because the function might allow for varying distributions (poisson, gaussian, etc.) and varying outlier thresholds within those distributions (like the m I've used here).

aaren
  • 5,325
  • 6
  • 29
  • 24
  • Related: [Can scipy.stats identify and mask obvious outliers?](http://stackoverflow.com/questions/10231206/can-scipy-stats-identify-and-mask-obvious-outliers), though that question seems to deal with more complex situations. For the simple task you described, an external package seems to be overkill. – Sven Marnach Jul 27 '12 at 11:22
  • I was thinking that given the number of builtins in the main numpy library it was strange that there was nothing to do this. It seems like quite a common thing to do with raw, noisy data. – aaren Jul 27 '12 at 12:10
  • Linear outliers can be found by `numpy std` function, however, if the data is non-linear, for example, a parabola or cubic function, `standard deviation` will not handle the task well, since it needs regression to help working out the outliers. – Weilory Nov 20 '20 at 06:31
  • That's why I coded this repo: [outliers.py](https://github.com/Weilory/python-outliers) – Weilory Nov 20 '20 at 06:32

15 Answers15

228

Something important when dealing with outliers is that one should try to use estimators as robust as possible. The mean of a distribution will be biased by outliers but e.g. the median will be much less.

Building on eumiro's answer:

def reject_outliers(data, m = 2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d/mdev if mdev else np.zeros(len(d))
    return data[s<m]

Here I have replace the mean with the more robust median and the standard deviation with the median absolute distance to the median. I then scaled the distances by their (again) median value so that m is on a reasonable relative scale.

Note that for the data[s<m] syntax to work, data must be a numpy array.

kmaork
  • 5,722
  • 2
  • 23
  • 40
Benjamin Bannier
  • 55,163
  • 11
  • 60
  • 80
  • 8
    http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm this is basically the modified Z-score referenced here, but with a different threshold. If my math is right, they recommend an m of `3.5 / .6745 ~= 5.189` (they multiply `s` by .6745 and specify an `m` of 3.5...also take `abs(s)`). Can anybody explain the choice of m? Or is it something you'll identify from your particular dataset? – Charlie G Apr 05 '17 at 21:53
  • @Charlie: The choice for `m` depends on the input data set. Its value ultimately decides the purity of the filtered data and the efficiency of rejecting true outliers. One way to determine a good value would be to examine the interplay of purity and efficiency from simulated (training) data, and pick some value. – Benjamin Bannier May 27 '17 at 09:56
  • 2
    @BenjaminBannier: Can you please provide some concrete explanation for choosing a value for `m` rather than fluffy statements like "interplay of purity and efficiency"? – stackoverflowuser2010 Jun 27 '17 at 20:26
  • 2
    @stackoverflowuser2010: Like I said, this depends on your specific requirements, i.e., how clean we need to signal sample to be (false positives), or how many signal measurements we can afford to throw away to keep the signal clean (false negatives). As for a specific example evaluation for a certain use case, see e.g., http://www.desy.de/~blist/notes/whyeffpur.ps.gz. – Benjamin Bannier Jun 28 '17 at 11:57
  • 1
    I'm looking for the Igleqicz and Hoaglin paper, but haven't found it yet. Is there a reason given for why the 0.6745 factor is hardcoded into the modified Z-score provided by NIST? It does seem odd to include a scaling factor on one side of the test and then an arbitrary threshold of 3.5 on the other. That scaling factor must have a meaning, otherwise it would have been as easy to drop it and suggest values greater than 5 are outliers. – Omegaman Aug 24 '17 at 16:53
  • this gives me a 'list index out of range' – john k Sep 16 '17 at 22:28
  • 2
    I get the following error when I call the function with a list of floats: `TypeError: only integer scalar arrays can be converted to a scalar index` – Vasilis Mar 19 '18 at 02:29
  • 3
    @Charlie, if you look at the figure https://www.itl.nist.gov/div898/handbook/eda/section3/eda356.htm#MAD , you will see that when dealing with normal distribution (which in fact is not the case you would need the modified z-scores) with SD = 1, you have MAD ~ 0.68, which explains the scaling factor. The choice of m = 3.5 therefore implies that you want to discard 0.05 % of the data. – Fato39 Apr 26 '18 at 13:45
  • @Fato39 so m can be thought of as a discarding decision parameter? Neat! – Charlie G Apr 28 '18 at 00:27
  • This has given me much better results than the accepted answer. I've added this section to enable receiving an arbitrary list: ```def reject_outliers(data, m = 2.): if not isinstance(data, np.ndarray): data = np.array(data) d = np.abs(data - np.median(data)) mdev = np.median(d) s = d/mdev if mdev else 0. return data[s – gkpln3 Sep 26 '18 at 12:49
  • I want to remove x precent of my data. How should I set m=...? I dont get it. – sqp_125 Nov 21 '19 at 13:24
  • @sqp_125 For removing x percent of your data, take a look at [numpy.quantile](https://numpy.org/devdocs/reference/generated/numpy.quantile.html). To filter out the top 10% of your data: `data_filt = data[data < np.quantile(data, 0.90)]`. Or to get the middle 50% of your data: `data_filt = data[(data >= np.quantile(data, 0.25)) * (data < np.quantile(data, 0.75))]` – Nathaniel Jones May 03 '21 at 22:47
  • Can someone explain in simple terms how this function determines outliers? Something like: "it is removing values that are `m` standard deviations away from the mean." I know that is not literally what it's doing but I'm looking for an explanation in those terms. – Hefe Nov 04 '22 at 04:09
144

This method is almost identical to yours, just more numpyst (also working on numpy arrays only):

def reject_outliers(data, m=2):
    return data[abs(data - np.mean(data)) < m * np.std(data)]
eumiro
  • 207,213
  • 34
  • 299
  • 261
  • 5
    That method works good enough if `m` is sufficiently large (e.g. `m=6`), but for small values of `m` this suffers from the mean the variance not being robust estimators. – Benjamin Bannier May 15 '13 at 09:53
  • 35
    that isn't really a complaint about the method though, but a complaint about the vague notion of an 'outlier' – Eelco Hoogendoorn Aug 15 '14 at 08:48
  • 4
    how do you choose an m? – john k Sep 16 '17 at 22:10
  • 1
    I have not gotten this to work. I keep getting an error return data[abs(data - np.mean(data)) < m * np.std(data)] TypeError: only integer scalar arrays can be converted to a scalar index OR it just freezes my program – john k Sep 16 '17 at 22:26
  • 2
    @johnktejik data arg needs to be a numpy array. – Sander van Leeuwen Dec 05 '17 at 15:46
  • @eumiro I think you need <= to handle the case where all data have the same value. Adding a small epsilon on the right side of the equation should make it more robust for similar cases too. Thanks for this clean, simple solution. – lorenzo May 04 '18 at 12:55
  • This doesn't work for the numbers [1, 56, 44, 68, 12, 22, 35, 5, 1000, 1020]. It leaves 1000 in the data. – Cybernetic Nov 20 '20 at 05:53
  • although this works, it's way faster to use mu=np.mean(data) s=np.std(data) data[np.logical_and(datamu-m * s)] – Hashemi Emad Jun 01 '21 at 13:01
19

Benjamin Bannier's answer yields a pass-through when the median of distances from the median is 0, so I found this modified version a bit more helpful for cases as given in the example below.

def reject_outliers_2(data, m=2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d / (mdev if mdev else 1.)
    return data[s < m]

Example:

data_points = np.array([10, 10, 10, 17, 10, 10])
print(reject_outliers(data_points))
print(reject_outliers_2(data_points))

Gives:

[[10, 10, 10, 17, 10, 10]]  # 17 is not filtered
[10, 10, 10, 10, 10]  # 17 is filtered (it's distance, 7, is greater than m)
Egal
  • 1,374
  • 12
  • 22
  • What do you mean it doesn't work for a simple 1-d array, Oleg? – Egal Jul 13 '23 at 13:01
  • Well, I put it very simple and concise. Can't share my code due to the NDA, but this solution does not give me the expected answer. I removed my answer since it may be misleading, maybe it does work for others in their cases. The solution below from user ankostis did the trick. – Олег Місько Jul 14 '23 at 20:32
14

Building on Benjamin's, using pandas.Series, and replacing MAD with IQR:

def reject_outliers(sr, iq_range=0.5):
    pcnt = (1 - iq_range) / 2
    qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
    iqr = qhigh - qlow
    return sr[ (sr - median).abs() <= iqr]

For instance, if you set iq_range=0.6, the percentiles of the interquartile-range would become: 0.20 <--> 0.80, so more outliers will be included.

ankostis
  • 8,579
  • 3
  • 47
  • 61
5

An alternative is to make a robust estimation of the standard deviation (assuming Gaussian statistics). Looking up online calculators, I see that the 90% percentile corresponds to 1.2815σ and the 95% is 1.645σ (http://vassarstats.net/tabs.html?#z)

As a simple example:

import numpy as np

# Create some random numbers
x = np.random.normal(5, 2, 1000)

# Calculate the statistics
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Add a few large points
x[10] += 1000
x[20] += 2000
x[30] += 1500

# Recalculate the statistics
print()
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
p90 = np.percentile(x, 90)
p10 = np.percentile(x, 10)
p50 = np.median(x)
# p50 to p90 is 1.2815 sigma
rSig = (p90-p50)/1.2815
print("Robust Sigma=", rSig)

rSig = (p90-p10)/(2*1.2815)
print("Robust Sigma=", rSig)

The output I get is:

Mean=  4.99760520022
Median=  4.95395274981
Max/Min= 11.1226494654   -2.15388472011
Sigma= 1.976629928
90th Percentile 7.52065379649

Mean=  9.64760520022
Median=  4.95667658782
Max/Min= 2205.43861943   -2.15388472011
Sigma= 88.6263902244
90th Percentile 7.60646688694

Robust Sigma= 2.06772555531
Robust Sigma= 1.99878292462

Which is close to the expected value of 2.

If we want to remove points above/below 5 standard deviations (with 1000 points we would expect 1 value > 3 standard deviations):

y = x[abs(x - p50) < rSig*5]

# Print the statistics again
print("Mean= ", np.mean(y))
print("Median= ", np.median(y))
print("Max/Min=", y.max(), " ", y.min())
print("StdDev=", np.std(y))

Which gives:

Mean=  4.99755359935
Median=  4.95213030447
Max/Min= 11.1226494654   -2.15388472011
StdDev= 1.97692712883

I have no idea which approach is the more efficent/robust

Chris
  • 852
  • 1
  • 8
  • 19
3

I wanted to do something similar, except setting the number to NaN rather than removing it from the data, since if you remove it you change the length which can mess up plotting (i.e. if you're only removing outliers from one column in a table, but you need it to remain the same as the other columns so you can plot them against each other).

To do so I used numpy's masking functions:

def reject_outliers(data, m=2):
    stdev = np.std(data)
    mean = np.mean(data)
    maskMin = mean - stdev * m
    maskMax = mean + stdev * m
    mask = np.ma.masked_outside(data, maskMin, maskMax)
    print('Masking values outside of {} and {}'.format(maskMin, maskMax))
    return mask
Alex S
  • 4,726
  • 7
  • 39
  • 67
  • 1
    You could also np.clip them to minimum and maximum allowed values to keep the dimensions. – Andy R Aug 31 '18 at 09:19
3

I would like to provide two methods in this answer, solution based on "z score" and solution based on "IQR".

The code provided in this answer works on both single dim numpy array and multiple numpy array.

Let's import some modules firstly.

import collections
import numpy as np
import scipy.stats as stat
from scipy.stats import iqr

z score based method

This method will test if the number falls outside the three standard deviations. Based on this rule, if the value is outlier, the method will return true, if not, return false.

def sd_outlier(x, axis = None, bar = 3, side = 'both'):
    assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'

    d_z = stat.zscore(x, axis = axis)

    if side == 'gt':
        return d_z > bar
    elif side == 'lt':
        return d_z < -bar
    elif side == 'both':
        return np.abs(d_z) > bar

IQR based method

This method will test if the value is less than q1 - 1.5 * iqr or greater than q3 + 1.5 * iqr, which is similar to SPSS's plot method.

def q1(x, axis = None):
    return np.percentile(x, 25, axis = axis)

def q3(x, axis = None):
    return np.percentile(x, 75, axis = axis)

def iqr_outlier(x, axis = None, bar = 1.5, side = 'both'):
    assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'

    d_iqr = iqr(x, axis = axis)
    d_q1 = q1(x, axis = axis)
    d_q3 = q3(x, axis = axis)
    iqr_distance = np.multiply(d_iqr, bar)

    stat_shape = list(x.shape)

    if isinstance(axis, collections.Iterable):
        for single_axis in axis:
            stat_shape[single_axis] = 1
    else:
        stat_shape[axis] = 1

    if side in ['gt', 'both']:
        upper_range = d_q3 + iqr_distance
        upper_outlier = np.greater(x - upper_range.reshape(stat_shape), 0)
    if side in ['lt', 'both']:
        lower_range = d_q1 - iqr_distance
        lower_outlier = np.less(x - lower_range.reshape(stat_shape), 0)

    if side == 'gt':
        return upper_outlier
    if side == 'lt':
        return lower_outlier
    if side == 'both':
        return np.logical_or(upper_outlier, lower_outlier)

Finally, if you want to filter out the outliers, use a numpy selector.

Have a nice day.

Losses Don
  • 923
  • 10
  • 14
3

Consider that all the above methods fail when your standard deviation gets very large due to huge outliers.

(Simalar as the average caluclation fails and should rather caluclate the median. Though, the average is "more prone to such an error as the stdDv".)

You could try to iteratively apply your algorithm or you filter using the interquartile range: (here "factor" relates to a n*sigma range, yet only when your data follows a Gaussian distribution)

import numpy as np

def sortoutOutliers(dataIn,factor):
    quant3, quant1 = np.percentile(dataIn, [75 ,25])
    iqr = quant3 - quant1
    iqrSigma = iqr/1.34896
    medData = np.median(dataIn)
    dataOut = [ x for x in dataIn if ( (x > medData - factor* iqrSigma) and (x < medData + factor* iqrSigma) ) ] 
    return(dataOut)
K. Foe
  • 31
  • 2
  • Sorry, I overlooked that there is already an IQR suggestion above. Should I leave this answer anyway due to shorter code or delete it? – K. Foe Sep 09 '19 at 16:14
1

Here I find the outliers in x and substitute them with the median of a window of points (win) around them (taking from Benjamin Bannier answer the median deviation)

def outlier_smoother(x, m=3, win=3, plots=False):
    ''' finds outliers in x, points > m*mdev(x) [mdev:median deviation] 
    and replaces them with the median of win points around them '''
    x_corr = np.copy(x)
    d = np.abs(x - np.median(x))
    mdev = np.median(d)
    idxs_outliers = np.nonzero(d > m*mdev)[0]
    for i in idxs_outliers:
        if i-win < 0:
            x_corr[i] = np.median(np.append(x[0:i], x[i+1:i+win+1]))
        elif i+win+1 > len(x):
            x_corr[i] = np.median(np.append(x[i-win:i], x[i+1:len(x)]))
        else:
            x_corr[i] = np.median(np.append(x[i-win:i], x[i+1:i+win+1]))
    if plots:
        plt.figure('outlier_smoother', clear=True)
        plt.plot(x, label='orig.', lw=5)
        plt.plot(idxs_outliers, x[idxs_outliers], 'ro', label='outliers')                                                                                                                    
        plt.plot(x_corr, '-o', label='corrected')
        plt.legend()
    
    return x_corr

enter image description here

scrx2
  • 2,242
  • 1
  • 12
  • 17
1

So many answers, but I'm adding a new one that can be useful for the author or even for other users.

You could use the Hampel filter. But you need to work with Series.

Hampel filter returns the Outliers indices, then you can delete them from the Series, and then convert it back to a List.

To use Hampel filter, you can easily install the package with pip:

pip install hampel

Usage:

# Imports
from hampel import hampel
import pandas as pd

list_d = [2, 4, 5, 1, 6, 5, 40]

# List to Series
time_series = pd.Series(list_d)

# Outlier detection with Hampel filter
# Returns the Outlier indices
outlier_indices = hampel(ts = time_series, window_size = 3)

# Drop Outliers indices from Series
filtered_d = time_series.drop(outlier_indices)

filtered_d.values.tolist()

print(f'filtered_d: {filtered_d.values.tolist()}')

And the output will be:

filtered_d: [2, 4, 5, 1, 6, 5]

Where, ts is a pandas Series object and window_size is a total window size will be computed as 2 * window_size + 1.

For this Series I set window_size with the value 3.

The cool thing about working with Series is being able to generate graphics:

# Imports
import matplotlib.pyplot as plt

plt.style.use('seaborn-darkgrid')

# Plot Original Series
time_series.plot(style = 'k-')
plt.title('Original Series')
plt.show()
    
# Plot Cleaned Series
filtered_d.plot(style = 'k-')
plt.title('Cleaned Series (Without detected Outliers)')
plt.show()

And the output will be:

enter image description here enter image description here

To learn more about Hampel filter, I recommend the following readings:

whoisraibolt
  • 2,082
  • 3
  • 22
  • 45
0

if you want to get the index position of the outliers idx_list will return it.

def reject_outliers(data, m = 2.):
        d = np.abs(data - np.median(data))
        mdev = np.median(d)
        s = d/mdev if mdev else 0.
        data_range = np.arange(len(data))
        idx_list = data_range[s>=m]
        return data[s<m], idx_list

data_points = np.array([8, 10, 35, 17, 73, 77])  
print(reject_outliers(data_points))

after rejection: [ 8 10 35 17], index positions of outliers: [4 5]
Caner Erden
  • 556
  • 7
  • 7
0

For a set of images (each image has 3 dimensions), where I wanted to reject outliers for each pixel I used:

mean = np.mean(imgs, axis=0)
std = np.std(imgs, axis=0)
mask = np.greater(0.5 * std + 1, np.abs(imgs - mean))
masked = np.multiply(imgs, mask)

Then it is possible to compute the mean:

masked_mean = np.divide(np.sum(masked, axis=0), np.sum(mask, axis=0))

(I use it for Background Subtraction)

ron653
  • 5
  • 6
0

Trim outliers in a numpy array along axis and replace them with min or max values along this axis, whichever is closer. The threshold is z-score:

def np_z_trim(x, threshold=10, axis=0):
    """ Replace outliers in numpy ndarray along axis with min or max values
    within the threshold along this axis, whichever is closer."""
    mean = np.mean(x, axis=axis, keepdims=True)
    std = np.std(x, axis=axis, keepdims=True)
    masked = np.where(np.abs(x - mean) < threshold * std, x, np.nan)
    min = np.nanmin(masked, axis=axis, keepdims=True)
    max = np.nanmax(masked, axis=axis, keepdims=True)
    repl = np.where(np.abs(x - max) < np.abs(x - min), max, min)
    return np.where(np.isnan(masked), repl, masked)
Yaroslav
  • 1,241
  • 10
  • 13
0

My solution drops the top and bottom percentiles, keeping values that are equal to the boundary:

def remove_percentile_outliers(data, percent_to_drop=0.001):
    low, high = data.quantile([percent_to_drop / 2, 1-percent_to_drop / 2])
    return data[(data >= low )&(data <= high)]
Johnny V
  • 1,108
  • 14
  • 21
0

My solution let the outliers equal to the previous value.

test_data = [2,4,5,1,6,5,40, 3]
def reject_outliers(data, m=2):
    mean = np.mean(data)
    std = np.std(data)
    for i in range(len(data)) :
        if np.abs(data[i] -mean) > m*std :
            data[i] = data[i-1]
    return data
reject_outliers(test_data)

Output:

[2, 4, 5, 1, 6, 5, 5, 3]