0

So, I needed some python code that would remove outliers from a list that contains 3-15 values that range from -1 to 1 (the # of values in the list varies with each iteration of my code).

For instance, if I had this list: [-0.7175272120817555, 0.9837397718584584, -0.36232204136099067, -0.49069507966635584, 0.24974696098557403, -0.0728351680016388, 0.054339948765399715, -0.657257432018868]

The obvious outlier to remove would be 0.98 (2nd value).

So, I found this code as an answer to a similar question:

   def reject_outliers(data, m=6):
        data=np.array(data)
        d = np.abs(data - np.median(data))
        mdev = np.median(d)
        s = d / (mdev if mdev else 1.)
        if data[s<m].tolist() == []:
            return data
        else:
            return data[s < m].tolist()

My question is I'm not sure what changing "m" does to the output of this function. I know that as you lower "m", more values are counted as outliers. But, I want to understand this function like I understand z scores (for instance, I know a z score of 3 refers to 3 standard deviations). So, my question is what does "m" represent, statistically or in terms of standard deviations?

These are some other specific questions I have about the function: Why is the median being calculated twice (first of "data" variable and then again for "d"? And what is the purpose of "mdev if mdev else 1."? Lastly, what does "s" represent and why is the cut off s < m? Thank you! Sorry my stats background isn't too great (I'm currently enrolled in an intro to stats course).

srv_77
  • 547
  • 1
  • 8
  • 20
  • 1
    You should link to [the question](https://stackoverflow.com/questions/11686720/is-there-a-numpy-builtin-to-reject-outliers-from-a-list) that contains that answer (among others). Looking at the question, and some of the other answers, there's so much interesting discussion of `m`, standard deviation, and various approaches to the problem that it's not clear what's left to discuss here. The best thing is probably to play with the function in the Python interpreter, and try changing `m` or inspecting different lines in the code. Then come back with a more specific question. – Matt Hall Sep 07 '21 at 17:51
  • What about the other questions I listed (last paragraph)? Thanks – srv_77 Sep 07 '21 at 20:12
  • Stack Overflow is really meant for programming questions — specific questions about getting code to work properly. But this code does work, you just need to study it a bit — and read the posts in that other question — to understand what it's doing. Outlier detection is a well studied topic, you should have no trouble finding things to try. And after that, if you have more statistics questions, consider asking on [Cross Validated](https://stats.stackexchange.com/). Or, if you have programming questions, this is definitely the place. Good luck! – Matt Hall Sep 08 '21 at 12:29

0 Answers0