-1

OK so admittedly this is related to another question here but there has been no response and I suspect it is because I have made it too complex. So Im asking this question which is different because it is simplified. Happy to be scolded if this is not acceptable.

My core problem is that I want to create a dataframe by including outliers only from each column. The dataframe looks like:

 chr   leftPos         TBGGT     12_try      324Gtt       AMN2
  1     24352           34         43          19         43
  1     53534           2          1           -1         -9
  2      34            -15         7           -9         -18
  3     3443           -100        -4          4          -9
  3     3445           -100        -1          6          -1
  3     3667            5          -5          9           5
  3     7882           -8          -9          1           3

I would like to calculate the upper and lower limit of each column (from the third onwards), exclude all rows that fall within the limits so I only keep outliers, and then end up with a dataframe as follows (for each column). This dataframe then gets passed to the next bit of the code (in the loop) but I wont elaborate on this for the sake of simplicity

chr   leftPos         TBGGT
 2      34            -15        
 3     3443           -100       
 3     3445           -100  

My code so far:

alpha= 1.5

 f1 <- function(df, ZCol){

  # Determine the UL and LL and then generate the Zoutliers
  UL = median(ZCol, na.rm = TRUE) + alpha*IQR(ZCol, na.rm = TRUE)
  LL = median(ZCol, na.rm = TRUE) - alpha*IQR(ZCol, na.rm = TRUE)
  Zoutliers <- which(ZCol > UL | ZCol < LL)}

but this just gives me the outlier values without the chr and leftPos it is associated with. How do I get this?

Community
  • 1
  • 1
Sebastian Zeki
  • 6,690
  • 11
  • 60
  • 125
  • Have you considered simple subsetting by applying your rules of UL and LL as limiting factors? E.g. `df[df$x > UL & df$x < LL,]` If I understand the question correctly that is... – statespace Apr 16 '15 at 07:36
  • I don't understand this question. Do you expect 4 data.frames as output? Why do you load data.table? Where is `alpha` defined? Couldn't you just use `x %in% boxplot.stats(x)$out` instead of your function? – Roland Apr 16 '15 at 07:43
  • I have included alpha in the code above. Yes I would expect 4 data frames as a result. I have removed data.table as its no relevant for this part of the code. Thanks for pointing it out – Sebastian Zeki Apr 16 '15 at 08:07
  • If you need `alpha` within the function, make it a function parameter. – Roland Apr 16 '15 at 08:23

2 Answers2

1

Maybe this:

DF <- read.table(text=" chr   leftPos         TBGGT     12_try      324Gtt       AMN2
  1     24352           34         43          19         43
  1     53534           2          1           -1         -9
  2      34            -15         7           -9         -18
  3     3443           -100        -4          4          -9
  3     3445           -100        -1          6          -1
  3     3667            5          -5          9           5
  3     7882           -8          -9          1           3", header = TRUE)

#fix your function as explained by @Thilo
#also make alpha a parameter with default value
f1 <- function(ZCol, alpha = 1.5){  
  UL <- median(ZCol, na.rm = TRUE) + alpha*IQR(ZCol, na.rm = TRUE)
  LL <- median(ZCol, na.rm = TRUE) - alpha*IQR(ZCol, na.rm = TRUE)
  ZCol > UL | ZCol < LL
}

#loop over the columns and subset with the function's logical return values
outlist <- lapply(3:6, function(i, df) {
  df[f1(df[,i]), c(1:2, i)]  
}, df = DF)


#[[1]]
#  chr leftPos TBGGT
#4   3    3443  -100
#5   3    3445  -100
#
#[[2]]
#  chr leftPos X12_try
#1   1   24352      43
#
#[[3]]
#  chr leftPos X324Gtt
#1   1   24352      19
#3   2      34      -9
#
#[[4]]
#  chr leftPos AMN2
#1   1   24352   43
Roland
  • 127,288
  • 10
  • 191
  • 288
  • So how to then reference the dataframe for TBGGT for example as you have shown above? outlist[1]? Doesnt seem to work for me? – Sebastian Zeki Apr 17 '15 at 06:32
0

You did basically provide the answer yourself, your just missing the last final link.

Your function computes the limits you define for outliers. We change the result such that it returns a boolean vector that is true if the value is an outlier:

isOutlier <- function(values) {
  # Determine the UL and LL
  UL <- median(values, na.rm = TRUE) + alpha*IQR(values, na.rm = TRUE)
  LL <- median(values, na.rm = TRUE) - alpha*IQR(values, na.rm = TRUE)
  values > UL | values < LL  # Return a boolean vector that can be used as a filter later on. 
}

Now you can subset your data frame simply using this function, i.e.

AMN2.outliers <- subset(df, isOutlier(AMN2))

or

AMN2.outliers <- df[isOutlier(AMN2),]

whichever suites you more. Of course you could also wrap this line in the function, but for readability I prefere the solution above.

Besides: I would suggest using the <- operator for assignment instead of =. See here.

Community
  • 1
  • 1
Thilo
  • 8,827
  • 2
  • 35
  • 56