Find outliers by Standard Deviation from mean, replace with NA in large dataset (6000+ columns)

Question

I've found a few answers similar to what I'm looking for, but the code has not worked.

I need to replace all outliers (as defined for our purposes as more than 2 SD away from the mean), of each column with NA or NULL. I'm trying to have this iterate across all columns, calculating the outliers for each column and replacing those values.

Disclosure: I also know statistically there are strong views on the removal of outliers, and the use of SD as a measure to exclude them given that SD is calculated including the outliers in this case. These were my instructions from our statistician, so I am working off that for now.

This is a glimpse of my dataset:

data

Group sp.Q13813.SPTN1_HUMAN sp.O14773.TPP1_HUMAN sp.P11137.MTAP2_HUMAN
1 Premutation           10713983468            367492324            2134747097
2 Premutation           10789498495            343303410            2677825476
3 Premutation           11134883489            383589325            2132552280
4 Premutation            9723552595            269965000            2262740921
5 Premutation           11175156282            359864993            1419225650
6 Premutation           10959077349            258095035            3343267633
7 Premutation           10770809133            331554977            2763604046
8 Premutation           11098182537            344384433            2198718886

I've tried code including the scale() function, but when I performed the below functions I found it replaced values that were within 2 SD of the mean of the column, and it left my code in the scaled format. I'm not as familiar with this function and was not sure how to revert it back to the original data points, even if it was replacing the correct outlier-deemed values.

Method attempt #1

# take note of order for column names
data.names_P <- colnames(data)


# scale all numeric columns
data.numeric.Pre <- select_if(data, is.numeric) %>% # subset of numeric columns
  mutate_all(scale)                             # perform scale separately for each column
data.numeric.Pre[data.numeric.Pre > 2] <- 99999 # set values larger than 2 to NA (none in this example)

# combine results with subset data frame of non-numeric columns
data.Pre <- data.frame(select_if(data, function(x) !is.numeric(x)),
                   data.numeric)

# restore columns to original order
data.Pre <- data.Pre[, data.names_P]

Method attempt #2


FindOutliers <- function(data) {
  upper = (2*sd(data) + mean(data)
  lower = (mean - 2*sd(data))
  result <- which(data < upper | data > lower)
}

I know attempt number two does not replace the data with NA.

Any help would be immensely appreciated.

score 2 · Accepted Answer · answered Jan 03 '20 at 01:13

2

You can use a ifelse function, here an example using dplyr and applying the ifelse function over all columns containing the term HUMAN:

library(dplyr)
data %>% mutate_at(.vars = vars(contains("HUMAN")), 
                   .funs= ~ifelse(abs(.)>mean(.)+2*sd(.), NA, .))

answered Jan 03 '20 at 01:13

dc37

15,840
4
15
32

selecting for variables containing "HUMAN" was not a function I was aware of, this could work too. Thank you! – norcanich Jan 03 '20 at 17:21
Do you know how to adjust this for a function where the iteration needs to run across a row, instead of down a column? I know how to adjust an apply() function for rows instead of columns, but am stuck with a dplyr mutate function - if its even possible. – norcanich Mar 16 '20 at 17:38
You can probably transpose your dataframe (inverting rows and columns) and then apply `dplyr` functions on it. Without data, it's hard to be sure of the solution – dc37 Mar 16 '20 at 18:46
I was having trouble with the normal transpose function for this - it was truncating my columns to only 50 when I have >30k rows that need to be inverted to columns...but was able to use the WGCNA package and transposeBigData() function to make your above recommendation work. Thanks! – norcanich Mar 17 '20 at 02:03
You're welcome. I did not know `transposeBigData` function. Thanks for the tip ;) – dc37 Mar 17 '20 at 02:07

score 2 · Answer 2 · answered Jan 03 '20 at 01:16

We can use function from your second attempt to replace data with NA

FindOutliers <- function(data) {
   mean_data <- mean(data, na.rm = TRUE)
   sd_data <- sd(data, na.rm = TRUE)
   upper = 2*sd_data + mean_data
   lower = mean_data - 2*sd_data
   replace(data, data > upper | data < lower, NA)
}

library(dplyr)
data %>%  mutate_if(is.numeric, FindOutliers)
#Tested it via
#mtcars %>%  mutate_if(is.numeric, FindOutliers)

Find outliers by Standard Deviation from mean, replace with NA in large dataset (6000+ columns)

2 Answers2