calculating the outliers in R

Question

I have a data frame like this:

x

Team 01/01/2012  01/02/2012  01/03/2012  01/01/2012 01/04/2012 SD Mean
A     100         50           40        NA         30       60  80

I like to perform calculation on each cell to the mean and sd to calculate the outliers. For example,

abs(x-Mean) > 3*SD

x$count<-c(1) (increment this value if the above condition is met).

I am doing this to check the anomaly in my data set. If I know the column names, it would be easier to do the calculations, but number of columns will vary. Some cells may have NA in them.

I like to subtrack mean from each cell, and I tried this

x$diff<-sweep(x, 1, x$Mean, FUN='-')

does not seem to be working, any ideas?

If you provide us with a little sample data with `dput(head(x))`, then we can just cut and paste it into our browsers, and test our solutions. — nograpes, Oct 12 '12 at 21:40

score 48 · Answer 1 · edited Jul 09 '15 at 14:00

48

Get your IQR (Interquartile range) and lower/upper quartile using:

lowerq = quantile(data)[2]
upperq = quantile(data)[4]
iqr = upperq - lowerq #Or use IQR(data)

Compute the bounds for a mild outlier:

mild.threshold.upper = (iqr * 1.5) + upperq
mild.threshold.lower = lowerq - (iqr * 1.5)

Any data point outside (> mild.threshold.upper or < mild.threshold.lower) these values is a mild outlier

To detect extreme outliers do the same, but multiply by 3 instead:

extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)

Any data point outside (> extreme.threshold.upper or < extreme.threshold.lower) these values is an extreme outlier

Hope this helps

edit: was accessing 50%, not 75%

edited Jul 09 '15 at 14:00

Karsten Sender

89
1
1
8

answered Oct 12 '12 at 21:45

Omar Wagih

8,504
7
59
75

3

Should be `upperq = quantile(data)[4]` – Ben Dec 19 '14 at 19:54
This is going to be a very bad algorithm .For e.g take a fairly large vector where say 80% of the data points are in a short range (e.g 10-100) and rest 20% are very sparse then this algorithm will identify a huge number of outliers ,which may not give a true sense of the outliers in the population – Bg1850 Feb 04 '16 at 22:13
3

The answer given here is a well-known approach due to Tukey. See: https://en.wikipedia.org/wiki/Outlier#Tukey.27s_test – stackoverflowuser2010 May 03 '16 at 05:09

score 9 · Answer 2 · answered Jun 14 '16 at 06:55

I have used @by0's answer above to create a function that automatically removes outliers. Here is the function and some example code:

# generate 10 random numbers and 2 'outlier' numbers
testData <- c(-42,rnorm(10),42)

# show the numbers
testData

# define a function to remove outliers
FindOutliers <- function(data) {
  lowerq = quantile(data)[2]
  upperq = quantile(data)[4]
  iqr = upperq - lowerq #Or use IQR(data)
  # we identify extreme outliers
  extreme.threshold.upper = (iqr * 3) + upperq
  extreme.threshold.lower = lowerq - (iqr * 3)
  result <- which(data > extreme.threshold.upper | data < extreme.threshold.lower)
}

# use the function to identify outliers
temp <- FindOutliers(testData)

# remove the outliers
testData <- testData[-temp]

# show the data with the outliers removed
testData

Roland · Answer 3 · 2012-10-13T10:04:44.150

I have seen that you've asked some questions on doing things by row. You should avoid that. R follows the concept that columns represent variables and rows represent observations. Many functions are optimized according to this concept. If you need a wide or transposed output to a file you can rearrange your data just before writing to the file.

I assume that your data actually looks as shown in the question, but that you have more than one row.

df <- read.table(text="Team 01/01/2012  01/02/2012  01/03/2012  01/01/2012 01/04/2012 SD 

Mean
A     100         50           40        NA         30       60  80
B     200         40           5         8          NA       NA  NA",check.names = FALSE,header=TRUE)

#needed because one date appears twice
df <- df[,]

#reshape the data
library(reshape2)
df <- melt(df,id="Team")
names(df)[2] <- "Date"

#remove the SD and Mean
df <- df[!df$Date %in% c("SD","Mean"),]

#function to detect outliers
outfun <- function(x) {
  abs(x-mean(x,na.rm=TRUE)) > 3*sd(x,na.rm=TRUE)
}

#test if function works
outfun(c(200,rnorm(10)))

#use function over all data
df3$outlier.all <- outfun(df3$value)

#apply function for each team 
library(plyr)
df3 <- ddply(df3,.(Team),transform,outlier.team=outfun(value))

Result:

           Date Team value outlier.all outlier.team
1    01/01/2012    A   100       FALSE        FALSE
2    01/02/2012    A    50       FALSE        FALSE
3    01/03/2012    A    40       FALSE        FALSE
4  01/01/2012.1    A    NA          NA           NA
5    01/04/2012    A    30       FALSE        FALSE
6    01/01/2012    B   200       FALSE        FALSE
7    01/02/2012    B    40       FALSE        FALSE
8    01/03/2012    B     5       FALSE        FALSE
9  01/01/2012.1    B     8       FALSE        FALSE
10   01/04/2012    B    NA          NA           NA

Hi @Roland. Thank you for your response. This is a special case where I have thousands of observations (rows), that I would like to find out the outliers and then graph them only. I am transforming dates to rows and trying to check each cell against the mean and count number of times it reached that outlier point. Then, I will pick 10 or 20 of the items and graph them. Basically, I am trying to catch the anamolies in my data set. — user1471980, Oct 13 '12 at 23:42
@user1471980, well, my answer is a starting point to do this. It is actually not difficult to do in R (provided the data is in long format and I demonstrated how to achieve that). Depending on the number of observations (you write thousands but that could also mean hundred thousands) and data groups, other approaches might be preferable considering computation time. But you'd better ask a new question giving all information including your end goal if my answer is not sufficient. — Roland, Oct 14 '12 at 11:09
thank you for your input. As per your recommendation, I did create another question, hopefully I've made point - http://stackoverflow.com/questions/12888212/detecting-outliers-on-wide-data-frame — user1471980, Oct 15 '12 at 01:31
Reshaping data - long or wide - is easy enough. Tidy data is preferable, but some operations are more practical rowwise. Tidy data should be a principle, not a religion. — Nettle, Oct 07 '18 at 00:11

score 2 · Answer 4 · edited Sep 23 '19 at 07:32

2

The following formulas could be used to determine which values are outliers:

upper.outlier.calc <- function(x.var, df){
  with(df, quantile(x.var, 0.75) + (1.5 * (quantile(x.var, 0.75) - quantile(x.var, 0.25))))
}

lower.outlier.calc <- function(x.var, df){
  with(df, quantile(x.var, 0.25) - (1.5 * (quantile(x.var, 0.75) - quantile(x.var, 0.25))))
}

edited Sep 23 '19 at 07:32

Eric Leung

2,354
12
25

answered Jul 09 '17 at 19:05

Jorge Riera

21
1

Jerry T · Answer 5 · 2019-01-24T05:03:15.293

check out my most sophisticated functions. It has three methods (z mad, iqr), and different processing of outliers (remove, or replace). Plots are available and hacking (trying different methods or thresholds) are possible

see the example:

set.seed(1234)
x = rnorm(10)
ez.outlier(iris,'Sepal.Length',fill='null',hack=T,cutoff=c(1,2,3),plot=T)

#' univariate outlier cleanup
#' @description univariate outlier cleanup
#' @param x a data frame or a vector
#' @param col colwise processing
#' \cr        col name
#' \cr        if x is not a data frame, col is ignored
#' \cr        could be multiple cols
#' @param method z score, mad, or IQR (John Tukey)
#' @param cutoff abs() > cutoff will be treated as outliers. Default/auto values (i.e. if NA):
#' \cr z 95% of values fall within 1.96, qnorm(0.025,lower.tail=F), or 3
#' \cr mad 2.5, which is the standard recommendation, or 5.2
#' \cr iqr 1.5
#' \cr if multiple values specified, use the first one (an exception is hack=T, during which method and cutoff same length or scalar)
#' @param hack call mapply to try all method and cutoff (same length or scalar, ie, different methods with 
#' corresponding cutoff, or same method with different cutoff).
#' @param plot boxplot and hist before and after outlier processing.
#' @param fillout how to process outlier, fill with na, mean, median (columnwise for data frame), or 
#' null --> remove outlier (only for vector or df with single col specified)
#' @return returns a new data frame or vector. If hack=T, returns nothings
#' @note univariate outlier approach
#' The Z-score method relies on the mean and standard deviation of a group of data to measure central
#' tendency and dispersion. This is troublesome, because the mean and standard deviation are highly
#' affected by outliers – they are not robust. In fact, the skewing that outliers bring is one of the
#' biggest reasons for finding and removing outliers from a dataset!
#' Another drawback of the Z-score method is that it behaves strangely in small datasets – in fact,
#' the Z-score method will never detect an outlier if the dataset has fewer than 12 items in it.
#' \cr
#' \cr
#' Median absolute deviation, modified z-score. The median and MAD are robust measures of central tendency and dispersion, respectively.
#' \cr
#' \cr
#' Interquartile range method is that, like the modified Z-score method, it uses a robust measure of dispersion.
#' \cr
#' @examples
#' set.seed(1234)
#' x = rnorm(10)
#' iris %>% ez.outlier('Sepal.Length',fill='null',hack=T,plot=T)
#' @export
ez.outlier = function(x, col=NULL, method=c('z','mad','iqr'), cutoff=NA, fillout=c('na','null','mean','median'), hack=FALSE, plot=FALSE, na.rm=TRUE, print2scr=TRUE) {
    # https://datascienceplus.com/rscript/outlier.R
    # https://cran.r-project.org/web/packages/outliers/index.html
    # https://rpubs.com/hauselin/outliersDetect

    if (hack==T){
            # here for programming reason, for mapply,
            # cutoff could not be NULL, use NA, because length(NULL)=0, but length(NA)=1
            mapply(ez.outlier,method=method,cutoff=cutoff,MoreArgs=list(x=x,col=col,hack=F,plot=plot,fillout=fillout,na.rm=na.rm,print2scr=print2scr),SIMPLIFY=F,USE.NAMES=F)
            cat('Hack done! No actual data returned.\n')
            return(invisible(NULL))
    }

 method = match.arg(method); fillout =fillout[1]; cutoff=cutoff[1]

    if (!is.data.frame(x)) {
        # todropna is a workaround for data frame with single col passed in

        x.bak.plot = x; x.replace.na = x; oldNAs = sum(is.na(x.replace.na))
        if (fillout=='na' | fillout=='todropna') {
            replacement = NA
        } else if (fillout=='mean') {
            replacement = mean(x, na.rm=na.rm)
        } else if (fillout=='median') {
            replacement = median(x, na.rm=na.rm)
        } else if (fillout=='null') {
            replacement = NULL
        }

        if (method=='z'){
            if(is.na(cutoff)) cutoff = qnorm(0.025,lower.tail=F)
            absz = abs((x - mean(x, na.rm=na.rm))/sd(x, na.rm=na.rm))
            if (!is.null(replacement)) {
                x[absz > cutoff] <- replacement
                } else {
                    # if nothing above cutoff, x is untouched
                    if (length(which(absz > cutoff)) > 0) {
                        x = x[-which(absz > cutoff)]
                    }
                }
            x.replace.na[absz > cutoff] <- NA
        } else if (method=='mad'){
            if(is.na(cutoff)) cutoff = 2.5
            absmad <- abs((x - median(x, na.rm=na.rm))/mad(x, na.rm=na.rm))
            if (!is.null(replacement)) {
                x[absmad > cutoff] <- replacement
                } else {
                    if (length(which(absmad > cutoff)) > 0) {
                        x = x[-which(absmad > cutoff)]
                    }
                }
            x.replace.na[absmad > cutoff] <- NA
        } else if (method=='iqr'){
            # https://stackoverflow.com/a/4788102/2292993
            if(is.na(cutoff)) cutoff = 1.5
            q1 <- quantile(x, 0.25, na.rm=na.rm)
            q3 <- quantile(x, 0.75, na.rm=na.rm)
            # alternatively iqr = q3-q1
            iqr = IQR(x, na.rm = na.rm)
            lower_bound = q1 - (iqr * cutoff)
            upper_bound = q3 + (iqr * cutoff)
            if (!is.null(replacement)) {
                x[(x > upper_bound) | (x < lower_bound)] <- replacement
                } else {
                    if (length(which((x > upper_bound) | (x < lower_bound))) > 0) {
                        x = x[-which((x > upper_bound) | (x < lower_bound))]
                    }
                }
            x.replace.na[(x.replace.na > upper_bound) | (x.replace.na < lower_bound)] <- NA
        }

        newNAs = sum(is.na(x.replace.na)) - oldNAs
        if (print2scr) {
            if (!is.null(col)) {
                cat(sprintf('%-15s %5s(%.2f): %3d outliers found and %s.\n', toString(col), toupper(method), cutoff, newNAs, ifelse((is.null(replacement)|fillout=='todropna'),'REMOVED','REPLACED')))
            } else {
                cat(sprintf('%5s(%.2f): %3d outliers found and %s.\n', toupper(method), cutoff, newNAs, ifelse((is.null(replacement)|fillout=='todropna'),'REMOVED','REPLACED')))
            }
        }

        if (plot){
            # mar controls margin size for individual plot it goes c(bottom, left, top, right)
            # oma is margin for the whole?
            opar = par(mfrow=c(2, 2), oma=c(0,0,1.5,0), mar = c(2,2,1.5,0.5))
            on.exit(par(opar))
            boxplot(x.bak.plot, main=sprintf("With outliers (n=%d)",length(x.bak.plot)))
            hist(x.bak.plot, main=sprintf("With outliers (n=%d)",length(x.bak.plot)), xlab=NULL, ylab=NULL)

            boxplot(x, main=sprintf("With outliers (n=%d)",length(x.bak.plot)-newNAs))
            hist(x, main=sprintf("With outliers (n=%d)",length(x.bak.plot)-newNAs), xlab=NULL, ylab=NULL)
            title(sprintf("%s Outlier Check: %s(%.2f)",toString(col), toupper(method), cutoff), outer=TRUE)
        }
    } else if (is.data.frame(x)) {
        if (length(col)>1 & fillout=='null') {
            cat('I do not know how to remove univariate outliers in multiple cols. fillout: null --> na ...\n')
            fillout='na'
        } else if (fillout=='null') {
            fillout='todropna'
        }
        # trick to pass actual col name
        x[col] = lapply(1:length(col), function(j) {ez.outlier(x=x[col][[j]],col=col[j],method=method,cutoff=cutoff,plot=plot,hack=hack,fillout=fillout,na.rm=na.rm,print2scr=print2scr)})
        if (fillout=='todropna') x=x[complete.cases(x[,col,drop=FALSE]),,drop=FALSE]
    } # end if
    return(invisible(x))
}

calculating the outliers in R

5 Answers5

Linked