How to find values below (or above) average

Question

As you can see from the following summary, the count for 1 Sep (1542677) is way below the average count per month.

from StringIO import StringIO

myst="""01/01/2016  8781262
01/02/2016  8958598
01/03/2016  8787628
01/04/2016  9770861
01/05/2016  8409410
01/06/2016  8924784
01/07/2016  8597500
01/08/2016  6436862
01/09/2016  1542677
"""
u_cols=['month', 'count']

myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep='\t', names = u_cols)

Is there a mathematical formula that can define this "way below or too high" (ambiguous) concept?

This is easy if I define a limit (for e.g. 9 or 10%). But I want the script to decide that for me and return the values if the difference between the lowest and second last lowest value is more than overall 5%. In this case the September month count should be returned.

score 5 · Accepted Answer · answered Oct 10 '16 at 05:15

5

A very common approach to filtering outliers is to use standard deviation. In this case, we will calculate a zscore which will quickly identify how many standard deviations away from the mean each observation is. We can then filter those observations that are greater than 2 standard deviations. For normally distributed random variables, this should happen approximately 5% of the time.

Define a zscore function

def zscore(s):
    return (s - np.mean(s)) / np.std(s)

Apply it to the count column

zscore(df['count'])

0    0.414005
1    0.488906
2    0.416694
3    0.831981
4    0.256946
5    0.474624
6    0.336390
7   -0.576197
8   -2.643349
Name: count, dtype: float64

Notice that the September observation is 2.6 standard deviations away.

Use abs and gt to identify outliers

zscore(df['count']).abs().gt(2)

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8     True
Name: count, dtype: bool

Again, September comes back true.

Tie it all together to filter your original dataframe

df[zscore(df['count']).abs().gt(2)]

filter the other way

df[zscore(df['count']).abs().le(2)]

answered Oct 10 '16 at 05:15

piRSquared

285,575
57
475
624

If I change April value from 9770861 to 977086. It does not return anything. In this case, I expected both April and September. There can be more than 2 months out of range! – shantanuo Oct 10 '16 at 05:36
@shantanuo of course, you are redefining what the distribution looks like with very few observations. If these were the only observations you had, then April's new value would signify that September's value is no longer as "odd" as we thought. You could still capture both by lowering your threshold from `2` to `1.5` which I don't recommend. Or you can trust that as you have many more observations for which to calculate an average, these two values would in fact be identified as outliers. – piRSquared Oct 10 '16 at 05:44
changing the threshold from 2 to 1.5 solved my problem. This is exactly what I was looking for. Is there any specific reason you don't recommend this? – shantanuo Oct 10 '16 at 05:49
@shantanuo, this approach only applies when your sample size is big (among other conditions). In that case you can determine which average % of the samples will fall out, namely: 95% for `2` sigma tolerance and 86% for `1.5`. If you have few examples and still don't want to have a fixed tolerance you may really want to do it manually, or try a ML approach like the ones I described – fr_andres Oct 10 '16 at 06:21

score 3 · Answer 2 · edited May 23 '17 at 11:48

first of all, the "way below or too high" concept you refer to is known as Outlier, and quoting Wikipedia (not the best source),

There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.

But on the other side:

In general, if the nature of the population distribution is known a priori, it is possible to test if the number of outliers deviate significantly from what can be expected.

So in my opinion this boils down to the question, wether is it possible to make assumptions about the nature of your data, to be able to automatize such decissions.

STRAIGHTFORWARD APPROACH

If you are lucky enough to have a relatively big sample size, and your different samples aren't correlated, you can apply the central limit theorem, which states that your values will follow a normal distribution (see this for a python-related explanation).

In this context, you may be able to quickly get the mean value and standard deviation of the given dataset. And by applying the corresponding function (with this two parameters) to each given value you can calculate its probability of belonging to the "cluster" (see this stackoverflow post for a possible python solution).

Then you do have to put a lower bound, since this distribution returns 0% probability only when a point is infinitely far away from the mean value. But the good thing is that (if the assumptions are true) this bound will nicely adapt to each different dataset, because of its exponential, normalized nature. This bound is typically expressed in Sigma unities, and widely used in science and statistics. As a matter of fact, the Physics Nobel Price 2013, dedicated to the discovery of Higgs boson, was granted after a 5-sigma range was reached, quoting the link:

High-energy physics requires even lower p-values to announce evidence or discoveries. The threshold for "evidence of a particle," corresponds to p=0.003, and the standard for "discovery" is p=0.0000003.

ALTERNATIVES

If you cannot make such simple assumptions of how your data should look like, you can always let a program infere them. This approach is a core feature of most machine learning algorithms, which can nicely adapt to strong correlated and even skewed data if finetuned properly. If this is what you need, Python has many good libraries for that purpose, that can even fit in a small script (the one I know best is tensorflow from google).

In this case I would regard two different approaches, depending again on how does your data look like:

Supervised learning: In case you have a training set at disposal, that states which samples belong and which ones don't (known as labeled), there are algorithms like the support vector machine that, although lightweight, can adapt to highly non-linear boundaries amazingly.
Unsupervised learning: This is probably what I would try first: When you simply have the unlabeled dataset. The "straightforward approach" I mentioned before is the simplest case of anomaly detector, and thus can be highly tweaked and customized to also regard correlations in an even infinite amount of dimensions, due to the kernel trick. To understand the motivations and approach of a ML-based anomaly detector, I would suggest to take a look at Andrew Ng's videos on the matter.

I hope it helps! Cheers

Thorough and insightful. certainly worthy of an upvote :-) – piRSquared Oct 10 '16 at 05:23 — piRSquared, Oct 10 '16 at 05:23

score 2 · Answer 3 · answered Oct 10 '16 at 05:42

One way to filter outliers is the interquartile range (IQR, wikipedia), which is the difference between 75% (Q3) and 25% quartile (Q1).

The outliers are defined if the data falls below Q1 - k * IQR resp. above Q3 + k * IQR.

You can select the constant k based on your domain knowledge (a common choice is 1.5).

Given the data, a filter in pandas could look like this:

iqr_filter = pd.DataFrame(df["count"].quantile([0.25, 0.75])).T
iqr_filter["iqr"] = iqr_filter[0.75]-iqr_filter[0.25]
iqr_filter["lo"] = iqr_filter[0.25] - 1.5*iqr_filter["iqr"]
iqr_filter["up"] = iqr_filter[0.75] + 1.5*iqr_filter["iqr"]
df_filtered = df.loc[(df["count"] > iqr_filter["lo"][0]) & (df["count"] < iqr_filter["up"][0]), :]

this might work better indeed for few samples, I found this article interesting https://www.dataz.io/display/Public/2013/03/20/Describing+Data%3A+Why+median+and+IQR+are+often+better+than+mean+and+standard+deviation — fr_andres, Oct 10 '16 at 06:26

How to find values below (or above) average

3 Answers3