1

I am trying to replace the "outliers" in each column of a dataframe with Nth percentile.

n <- 1000
set.seed(1234)
df <- data.frame(a=runif(n), b=rnorm(n), c=rpois(n,1))
df.t1 <- as.data.frame(lapply(df, function(x) { q <- quantile(x,.9,names=F); x[x>q] <- q; x }))

I need the computed quantiles to truncate other dataframes. For example, I compute these quantiles on a training dataset and apply it; I want to use those same thresholds in several test datasets. Here's an alternative approach which allows that.

q.df <- sapply(df, function(x) quantile(x,.9,names=F))
df.tmp <- rbind(q.df, df.t1)
df.t2 <- as.data.frame(lapply(df.tmp, function(x) { x[x>x[1]] <- x[1]; x }))
df.t2 <- df.t2[-1,]
rownames(df.t2) <- NULL
identical(df.t1, df.t2)

The dataframes are very large and hence I would prefer not to use rbind, and then delete the row later. Is is possible to truncate the columns in the dataframes using the q.df but without having to rbind? Thx.

smci
  • 32,567
  • 20
  • 113
  • 146
ironv
  • 978
  • 10
  • 25
  • Please use `set.seed(...)` to make your random example reproducible. – smci Apr 27 '14 at 03:28
  • Also, you only want to clip outliers on the high side, not the low side. So clip the 0.9 quantile, not the 0.1 quantile. – smci Apr 27 '14 at 03:39

1 Answers1

3

So just write a function that directly computes the quantile, then directly applies clipping to each column. The <- conditional assignment inside your lapply call is bogus; you want ifelse to return a vectorized expression for the entire column, already. ifelse is your friend, for vectorization.

# Make up some dummy df2 output (it's supposed to have 1000 cols really)
df2 <- data.frame(d=runif(1000), e=rnorm(1000), f=runif(1000))

require(plyr)
print(colwise(summary)(df2))  # show the summary before we clamp...

# Compute quantiles on df1...
df1 <- df
df1.quantiles <- apply(df1, 2, function(x, prob=0.9) { quantile(x, prob, names=F) })

# ...now clamp by sweeping col-index across both quantile vector, and df2 cols
clamp <- function(x, xmax) { ifelse(x<=xmax, x, xmax) }
for (j in 1:ncol(df2)) {
  df2[,j] <- clamp(df2[,j], df1.quantiles[j])  # don't know how to use apply(...,2,)
}

print(colwise(summary)(df2))  # show the summary after we clamp...

Reference: [1] "Clip values between a minimum and maximum allowed value in R"

Community
  • 1
  • 1
smci
  • 32,567
  • 20
  • 113
  • 146
  • The dataframes that I am using in my code have ~1000columns. So, listing each of them using df$a would not be possible. As I mentioned in my original writeup, I would like to compute the quantiles for the training dataframe, apply it to that df, and apply those same limits to several other test dfs. Your first approach computes the quantiles for each df separately, which is only a part of what I want. I will not be able to re-use those values for the other dfs. – ironv Apr 27 '14 at 16:36
  • @ironv: for the sake of clarity, post code that actually mirrors what you're trying to do. I've already spent time giving you a functional and performant solution to your code as written. You're just as capable of adapting it as I am. Actually you're more capable, since you understand your use case better. – smci Apr 27 '14 at 22:34
  • @ironv: ok I generalized it to sweep across all col-indices of both quantile vector df1.quantiles, and cols of df2. You can adapt this wo whatever you need. For passing in lots of dataframes df2,3,4... try either `lapply` or `do.call`. – smci Apr 28 '14 at 06:08