I have this dataframe:
DF1
with these columns:
obs_1 obs_2
31 173
16 20
38 49
12 16
45 49
14 174
83 88
43 46
43 46
27 45
32 40
625 669
4 4
61 99
20 26
103 -356
8 110
146 246
38 50
11 92
10 97
9 90
217 234
9 177
28 28
22 22
12 123
35 147
59 63
31 143
18 130
45 55
46 50
21 21
17 152
63 70
52 73
24 24
15 -1172
43 54
88 96
22 34
42 56
14 56
19 20
40 42
23 120
68 73
80 -1263
14 124
35 41
40 176
13 52
21 26
22 102
43 -1325
18 18
36 162
68 69
17 34
20 30
26 27
45 55
78 82
I am trying to find the outliers, noting if it is an outlier in a new column using this function:
def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.
Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.
Returns:
--------
mask : A numobservations-length boolean array.
References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:, None]
median = np.median(points, axis=0)
diff = (points - median) **2
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
Discussed here:Link to discussion
I have tried this code:
DF1['obs_1_outlier'] = is_outlier(df1.obs_1.to_numpy())
I don't receive any errors, but all results are FALSE, and I have a suspicion that something isn't calculating correctly in the function.
I have a feeling it is with the way I am sending the column to the function, but I can't put my finger on it.
Edit 1/2023 - removed np.sum from:
diff = np.sum((points - median)**2, axis=-1)
Thanks to Guilherme.