1

I am new to R and I tried to use a function that tests for outliers in a large dataframe with over 600 variables all numeric except for the last 2 columns. I tried the outlier function in the outliers package to test one column at a time, I ended with a numeric vector which I could not use. Is there a better way to identify all outliers in a dataframe.

 myout <- c()
    for (i in 1:dim(training)[2]){
     if (is.numeric(training[,i])) {
     myout <- c(myout,outlier(training[,i]))  }
     }
agstudy
  • 119,832
  • 17
  • 199
  • 261
omneya
  • 33
  • 4
  • You end up with a vector because you created a vector. What's the problem here? If the last two columns are not numeric, why are you testing them? – Roman Luštrik Mar 05 '13 at 08:32
  • 3
    What exactly was wrong with the outlier function? For one thing, the outlier function can be given a data frame, in which it tests each column individually... – David Robinson Mar 05 '13 at 08:33
  • Have you read `?outlier`? It returns the value within a vector that is most different from the mean, so a vector of values is exactly what should be expected from your code here. What are you expecting as output? – alexwhan Mar 05 '13 at 09:51
  • 1
    You may wish to review http://stackoverflow.com/questions/4787332/how-to-remove-outliers-from-a-dataset – Jack Ryan Mar 05 '13 at 13:24
  • I would be even more emphatic than Jack. You MUST look at the cited answer. If you have data with extreme values you should leave then in and use appropriate methods unless you are able to identify a reason. Just throwing out "outliers" is a major statistical sin. – IRTFM Mar 07 '13 at 00:28

2 Answers2

2

As you can read in the helpfile of outlier it finds one value for each variable, the one that differs the most from the mean. I think what you want is finding for each variable the index of all data points that are outliers. This can be done in the following way (of course you need to remove your non-numeric variables first):

# first write a custom function that returns the index of all outliers
# I define an outlier as 3 sd's away from the mean, you can adjust that

is.outlier <- function(x) which(abs(x - mean(x)) > 3*sd(x))

# turn the df into a list, and apply the function to each variable with lapply

df.as.list <- as.list(df)   # enter the name of your data frame instead of df
lapply(df.as.list, is.outlier)

It will return a list with at element i the indices of the outliers of the variable in column i.

Edwin
  • 3,184
  • 1
  • 23
  • 25
0

You may not actually want to remove outliers, but per this 2 years ago:

x[!x %in% boxplot.stats(x)$out] 
Community
  • 1
  • 1
Jack Ryan
  • 2,134
  • 18
  • 26