1

I'm currently trying to exclude outliers based on a subset of selected variables with the aim of performing sensitivity analyses. I've adapted the function available here: calculating the outliers in R), but have been unsuccesful so far (I'm still a novice R user). Please let me know if you have any suggestions!

df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005,   1006,   1007,   1008,   1009,   1010,   1011),
                 measure1 = rnorm(11, mean = 8, sd = 4),
                 measure2 = rnorm(11, mean = 40, sd = 5),
                 measure3 = rnorm(11, mean = 20, sd = 2),
                 measure4 = rnorm(11, mean = 9, sd = 3))

vars_of_interest <- c("measure1", "measure3", "measure4")

# define a function to remove outliers
FindOutliers <- function(data) {
  lowerq = quantile(data)[2]
  upperq = quantile(data)[4]
  iqr = upperq - lowerq #Or use IQR(data)
  # we identify extreme outliers
  extreme.threshold.upper = (iqr * 3) + upperq
  extreme.threshold.lower = lowerq - (iqr * 3)
  result <- which(data > extreme.threshold.upper | data < extreme.threshold.lower)
}

# use the function to identify outliers
temp <- FindOutliers(df[vars_of_interest])

# remove the outliers
testData <- testData[-temp]

# show the data with the outliers removed
testData
M_Oxford
  • 361
  • 4
  • 11
  • 2
    The error happens at `df[cogvars]`, before your even get a chance to run `FindOutliers`. What is in the variable `cogvars`? Your code should works as long as you pass a vector to `FindOutliers`. If you want a code review, post on [codereview](https://codereview.stackexchange.com/) (and optionally ping me with @asac)! – asachet Mar 23 '20 at 12:46
  • Make sure to use `df$col` or `df[["col"]]` to extract a column as a vector from a data.frame. Using `df["col"]` selects the column but returns a data.frame when you want a vector. – asachet Mar 23 '20 at 12:53
  • And `quantile` expects a vector input--it looks like you are trying to send it a data frame? If by `df[cogvars]` you actually mean `df[, vars_of_interest]`. You will need one of the `apply` family of functions to loop over the columns you want – paqmo Mar 23 '20 at 12:58
  • One more thing--what do you want to do to the outliers? Set them to missing? Remove the columns that contain outliers? You do you want your output to look like? That will be helpful to know. – paqmo Mar 23 '20 at 13:02
  • Thank you for all the useful suggestions. You are correct @pagmo about ```df[, vars_of_interest]```, I've modified this in the original question. I want to set them to missing and subsequently exclude them. – M_Oxford Mar 23 '20 at 14:15

1 Answers1

6

Separate the concerns:

  1. Identify outliers in a numeric vector using the IQR method. This can be encapsulated in a function taking a vector.
  2. Remove outliers from several columns of a data.frame. This is a function taking a data.frame.

I would suggest returning a boolean vector rather than indices. This way, the returned value is the size of the data which makes it easy to create a new column, for exampledf$outlier <- is_outlier(df$measure1).

Note how the argument names make it clear which type of input is expected: x is a standard name for a numeric vector and df is obviously a data.frame. cols is probably a list or vector of column names.

I made a point to only use base R but in real life I would use the dplyr package to manipulate the data.frame.

#' Detect outliers using IQR method
#' 
#' @param x A numeric vector
#' @param na.rm Whether to exclude NAs when computing quantiles
#' 
is_outlier <- function(x, na.rm = FALSE) {
  qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)

  lowerq <- qs[1]
  upperq <- qs[2]
  iqr = upperq - lowerq 

  extreme.threshold.upper = (iqr * 3) + upperq
  extreme.threshold.lower = lowerq - (iqr * 3)

  # Return logical vector
  x > extreme.threshold.upper | x < extreme.threshold.lower
}

#' Remove rows with outliers in given columns
#' 
#' Any row with at least 1 outlier will be removed
#' 
#' @param df A data.frame
#' @param cols Names of the columns of interest. Defaults to all columns.
#' 
#' 
remove_outliers <- function(df, cols = names(df)) {
  for (col in cols) {
    cat("Removing outliers in column: ", col, " \n")
    df <- df[!is_outlier(df[[col]]),]
  }
  df
}

Armed with these 2 functions, it becomes very easy:

df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005,   1006,   1007,   1008,   1009,   1010,   1011),
                 measure1 = rnorm(11, mean = 8, sd = 4),
                 measure2 = rnorm(11, mean = 40, sd = 5),
                 measure3 = rnorm(11, mean = 20, sd = 2),
                 measure4 = rnorm(11, mean = 9, sd = 3))

vars_of_interest <- c("measure1", "measure3", "measure4")


df_filtered <- remove_outliers(df, vars_of_interest)
#> Removing outliers in column:  measure1  
#> Removing outliers in column:  measure3  
#> Removing outliers in column:  measure4

df_filtered
#>      ID  measure1 measure2 measure3   measure4
#> 1  1001  9.127817 40.10590 17.69416  8.6031175
#> 2  1002 18.196182 38.50589 23.65251  7.8630485
#> 3  1003 10.537458 37.97222 21.83248  6.0798316
#> 4  1004  5.590463 46.83458 21.75404  6.9589981
#> 5  1005 14.079801 38.47557 20.93920 -0.6370596
#> 6  1006  3.830089 37.19281 19.56507  6.2165156
#> 7  1007 14.644766 37.09235 19.78774 10.5133674
#> 8  1008  5.462400 41.02952 20.14375 13.5247993
#> 9  1009  5.215756 37.65319 22.23384  7.3131715
#> 10 1010 14.518045 48.97977 20.33128  9.9482211
#> 11 1011  1.594353 44.09224 21.32434 11.1561089

Created on 2020-03-23 by the reprex package (v0.3.0)

asachet
  • 6,620
  • 2
  • 30
  • 74