2

I am using the below function to find outliers using 3*sd but in the results I am getting outliers and NA values. There should not be NA values in the outliers right?

how do I fix it?

findingoutlier<- function (data, cutoff=3, na.rm=TRUE){
  sd <- sd(data, na.rm=TRUE)
  mean <- mean(data, na.rm=TRUE)
  outliers <- (data[data < mean - cutoff * sd | data > mean + cutoff * sd])
  return (outliers)
}
Tess
  • 67
  • 6

3 Answers3

4

This is a fairly subtle outcome of the way that NA comparisons are handled in R.

Suppose you have an NA value in data. Then your criterion

data < mean - cutoff * sd | data > mean + cutoff * sd

evaluates to NA (i.e., we don't know if the unavailable data point is an outlier or not ...)

What do we get if we ask for data[NA]? From ?"[":

When extracting, a numerical, logical or character ‘NA’ index picks an unknown element and so returns ‘NA’ in the corresponding element of a logical, integer, numeric, complex or character result ...

(this is a technical way of saying "NA in, NA out").

So you should either drop NA values from your input (e.g. with na.omit(), or use

!is.na(data) & (data < mean - cutoff * sd | data > mean + cutoff * sd)

as your criterion.

I can't think of any other reasons you would end up with NA in output (and since you haven't given a reproducible example I can't guess what they would be ...)

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
2

You can easily remove NA using this:

outliers <- outliers[!is.na(outliers)]

So your function will look like this:

findingoutlier<- function (data, cutoff=3, na.rm=TRUE){
  sd <- sd(data, na.rm=TRUE)
  mean <- mean(data, na.rm=TRUE)
  outliers <- (data[data < mean - cutoff * sd | data > mean + cutoff * sd])
  outliers <- outliers[!is.na(outliers)]
  return (outliers)
}
Aziz
  • 20,065
  • 8
  • 63
  • 69
0

It looks like you're passing a vector of integers in the data parameter. outliers <- (data[data < mean - cutoff * sd | data > mean + cutoff * sd]).

With a silly example set a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9) this is searching for data < -3.215838 | data > 13.21584 which doesn't find a match.

I would default to using a package for outliers.

install.packages("outliers")
library(outliers)

values <- c(1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
outlier(values)
# prints [1] 8

Another option for time series data is Twitters package on anomaly detection

install.packages("devtools")
devtools::install_github("twitter/AnomalyDetection")
library(AnomalyDetection)

values <- c(1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
dates <- as.POSIXlt(c('2010-3-01', '2010-3-02','2010-3-03', '2010-3-04', '2010-3-05', '2010-3-06', '2010-3-07', '2010-3-08', '2010-3-09', '2010-3-10', '2010-3-11', '2010-3-12', '2010-3-13', '2010-3-14', '2010-3-15', '2010-3-16', '2010-3-17', '2010-3-18'
))
df <- data.frame(dates, values)
res = AnomalyDetectionTs(df, max_anoms=0.02, direction='both', plot=TRUE)
res$anoms
res$plot
#    timestamp anoms
# 1 2010-03-04     8

enter image description here

Lex
  • 4,749
  • 3
  • 45
  • 66