0

I have a list of 11 data frames, each with the sames structure of 44 variables. One of the variables is a ratio, and I am trying to trim records that contain outliers. I have been able to come up with upper and lower bounds using the following code.

First, I created a list of quantiles for each data frame:

quartiles <- lapply(class203_in, function(x) {
    quartiles <- quantile(x$mv_ratio, type=6)
    })

Next, I broke out the first and third quartile:

q1 <- lapply(quartiles, function(x) {
    q1 <- x[2]
    })

# create list of third quartile
q3 <- lapply(quartiles, function(x) {
    q3 <- x[4]
    })

Then I calculated the IQR:

iqr <- lapply(class203_in, function(x) {
    iqr <- IQR(x$mv_ratio, type=6)
    })

And finally came up with upper and lower bounds:

lower <- mapply(function(x, y) x - (y * 1.5), q1, iqr)
upper <- mapply(function(x, y) (y * 1.5) + x, q3, iqr)

The results look for each look like this (the upper bound has the same exact structure and names for each object in the list):

> lower
$`Yr02.25%`
[1] 0.1885

$`Yr03.25%`
[1] 0.2245

$`Yr04.25%`
[1] 0.2005

$`Yr05.25%`
[1] 0.1795

$`Yr06.25%`
[1] 0.2315

$`Yr07.25%`
[1] 0.127

$`Yr08.25%`
[1] 0.06125

$`Yr09.25%`
[1] 0.0365

$`Yr10.25%`
[1] -0.29725

$`Yr11.25%`
[1] -0.2985

$`Yr12.25%`
[1] -0.1045

I'm now trying to use these two lists to trim the outliers in my main list of data frames, where mv_ratio is the variable I'm trying to trim on. I've gotten close, but I can't seem to get it to kick out an exact replica of the data frames, in a list or otherwise. Here's the code that got me closest:

class203_out <- mapply(function(x, y, z) x <- x[which(x$mv_ratio > y &
     x$mv_ratio < z),], class203_in, lower, upper)   

class203_in is the list of data frames. When I run this, I get a huge matrix.

Any help or push in the right direction would be greatly appreciated.

Jason Grotto
  • 51
  • 1
  • 6
  • Possible duplicate of [Label or score outliers in R](http://stackoverflow.com/questions/32870703/label-or-score-outliers-in-r) – alexwhitworth Nov 17 '15 at 18:01
  • The answer in the above possible duplicate question should solve your problem... you should be able to easily modify that answer's code slightly if you need a slightly customized solution. – alexwhitworth Nov 17 '15 at 18:02

1 Answers1

1

Since you're only dealing with one list, and your problem seems straightforward, I would recommend using the doParallel package to do a foreach (parallelization possible if you wish, but default is sequential).

Also I recommend using data.table for everything, just because.

library(doParallel)
library(data.table)

subsetted_df_list <- foreach(i = seq(df_list)) %do% {
  x <- setDT(df_list[[i]])
  q <- quantile(x$mv_ratio, type = 6)
  iqr <- IDR(x$mv_ratio, type = 6)
  lower <- q[2] - iqr * 1.5
  upper <- q[4] + iqr * 1.5
  x[mv_ratio < upper & lower < mv_ratio]
}

This will return a list of the subsetted data frames from the original list, called here df_list.

mlegge
  • 6,763
  • 3
  • 40
  • 67