0

I am trying to use Tukey's test for calculating the average of each row of a data frame, excluding the outliers.

df <- data.frame(matrix(rnorm(1000000), ncol = 10))
averaging_wo_outliers <- function(x){
    q_result = quantile(x, probs = c(0.25, 0.75), na.rm=TRUE)
    lowerq = q_result[1]
    upperq = q_result[2]
    iqr = upperq - lowerq
    threshold_upper = (iqr * 1.5) + upperq
    threshold_lower = lowerq - (iqr * 1.5)
    return(mean(x[(x <= threshold_upper) & (
        x >= threshold_lower)]))
}
result <- apply(df, 1, averaging_wo_outliers)

Now this is pretty slow. Taking a similar approach to this answer I have been trying to make this faster with vectorizing. Is it even possible to make this task faster? Also, if it is not vectorizable (if that is a word!), do you think using dplyr and data.table might help or I shouldn't expect any improvement using those packages? Thanks.

Community
  • 1
  • 1
ahoosh
  • 1,340
  • 3
  • 17
  • 31
  • 5
    If you have 100% numeric data you should use a matrix instead of a data frame. Much of that time is spent converting your data frame to a matrix for `apply` – Rich Scriven Jan 02 '17 at 21:40
  • 1
    There is a built in function `IQR`. Have you tried that? – Sotos Jan 02 '17 at 21:43
  • As you sure it is averaging each row or column of dataframe? Quantiles are run column-wise. – Parfait Jan 03 '17 at 00:20
  • @RichScriven I tried `df <- matrix(rnorm(5000000), ncol = 10)` vs `df <- data.frame(matrix(rnorm(5000000), ncol = 10))` and got `70 seconds` on both! I am not sure why it doesn't make any difference. – ahoosh Jan 03 '17 at 00:27
  • @Parfait Yes, I looked into it. But I still need `lowerq` and `upperq` in addition to `iqr` for my calculations. – ahoosh Jan 03 '17 at 00:28

0 Answers0