So, I needed some python code that would remove outliers from a list that contains 3-15 values that range from -1 to 1 (the # of values in the list varies with each iteration of my code).
For instance, if I had this list: [-0.7175272120817555, 0.9837397718584584, -0.36232204136099067, -0.49069507966635584, 0.24974696098557403, -0.0728351680016388, 0.054339948765399715, -0.657257432018868]
The obvious outlier to remove would be 0.98 (2nd value).
So, I found this code as an answer to a similar question:
def reject_outliers(data, m=6):
data=np.array(data)
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d / (mdev if mdev else 1.)
if data[s<m].tolist() == []:
return data
else:
return data[s < m].tolist()
My question is I'm not sure what changing "m" does to the output of this function. I know that as you lower "m", more values are counted as outliers. But, I want to understand this function like I understand z scores (for instance, I know a z score of 3 refers to 3 standard deviations). So, my question is what does "m" represent, statistically or in terms of standard deviations?
These are some other specific questions I have about the function: Why is the median being calculated twice (first of "data" variable and then again for "d"? And what is the purpose of "mdev if mdev else 1."? Lastly, what does "s" represent and why is the cut off s < m? Thank you! Sorry my stats background isn't too great (I'm currently enrolled in an intro to stats course).