1

I have this dataframe:

DF1

with these columns:

obs_1   obs_2
31  173
16  20
38  49
12  16
45  49
14  174
83  88
43  46
43  46
27  45
32  40
625 669
4   4
61  99
20  26
103 -356
8   110
146 246
38  50
11  92
10  97
9   90
217 234
9   177
28  28
22  22
12  123
35  147
59  63
31  143
18  130
45  55
46  50
21  21
17  152
63  70
52  73
24  24
15  -1172
43  54
88  96
22  34
42  56
14  56
19  20
40  42
23  120
68  73
80  -1263
14  124
35  41
40  176
13  52
21  26
22  102
43  -1325
18  18
36  162
68  69
17  34
20  30
26  27
45  55
78  82

I am trying to find the outliers, noting if it is an outlier in a new column using this function:

def is_outlier(points, thresh=3.5):
    """
    Returns a boolean array with True if points are outliers and False 
    otherwise.

    Parameters:
    -----------
        points : An numobservations by numdimensions array of observations
        thresh : The modified z-score to use as a threshold. Observations with
            a modified z-score (based on the median absolute deviation) greater
            than this value will be classified as outliers.

    Returns:
    --------
        mask : A numobservations-length boolean array.

    References:
    ----------
        Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
        Handle Outliers", The ASQC Basic References in Quality Control:
        Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
    """
    if len(points.shape) == 1:
        points = points[:, None]
        median = np.median(points, axis=0)
        diff = (points - median) **2
        diff = np.sqrt(diff)
        med_abs_deviation = np.median(diff)

        modified_z_score = 0.6745 * diff / med_abs_deviation

        return modified_z_score > thresh

Discussed here:Link to discussion

I have tried this code:

DF1['obs_1_outlier'] =  is_outlier(df1.obs_1.to_numpy())

I don't receive any errors, but all results are FALSE, and I have a suspicion that something isn't calculating correctly in the function.

I have a feeling it is with the way I am sending the column to the function, but I can't put my finger on it.

Edit 1/2023 - removed np.sum from:

diff = np.sum((points - median)**2, axis=-1)

Thanks to Guilherme.

eclipsedlamp
  • 149
  • 9
  • 1
    change this `0.6745 * diff / med_abs_deviation` to including brackets. Either `(0.6745 * diff) / med_abs_deviation` or `0.6745 * (diff / med_abs_deviation)` based on your requirement. – vb_rises Dec 26 '19 at 18:20
  • Hmmm, I tried both but I am still getting all false results. I even tried to lower the threshold to 0.5. – eclipsedlamp Dec 26 '19 at 18:39
  • 1
    The function is working correctly. There are many interpretations of "outlier". Above is just one of the algorithms to detect them. Apparently following the logic, none of your data points are marked as outliers. – Erfan Dec 26 '19 at 18:48
  • @eclipsedlamp I tried your code and it is working correctly. I am getting `True` for some entries. – vb_rises Dec 26 '19 at 18:50
  • 2
    I found the problem I was having. My full data set had NaNs in the columns and it wasn't calculating the median properly. I swapped np.median for np.nanmedian and I am now getting correct results. – eclipsedlamp Dec 26 '19 at 20:38
  • 1
    I want to add an important comment. Yor code is wrong. You should take it out 'np.sum' from your diff. – Guilherme Giuliano Nicolau Oct 16 '22 at 18:43

0 Answers0