0

I want to write a function that will take a data.frame as input and return a new data.frame that has replaced outliers using the tsclean() function from the forecast package.

For the example input df (containing obvious outliers):

df <- data.frame(col1 = runif(24, 400, 700),
                 col2 = runif(24, 350, 600),
                 col3 = runif(24, 600, 940),
                 col4 = runif(24, 2000, 2600),
                 col5 = runif(24, 950, 1200))

colnames(df) <- c("2to2", "2to6", "17to9", "20to31", "90to90")
df$`2to2`[[12]]=10000
df$`17to9`[[20]]=6000
df$`20to31`[[8]]=12000

I've been trying to solve this as follows

clean_ts <- function(df, frequency = 12, start = c(2014, 1), end = c(2015, 12)) {

  ts <- ts(df, frequency = frequency, start = start, end = end)
  results <- list()

  for (i in 1:ncol(ts)) {
    clean <- as.data.frame(tsclean(ts[,i]))
    results[[i]] <- as.data.frame(cbind(clean))
  }
  return(results)
}

I know this is wrong. Instead of returning a list I want my function to return a data.frame with the same dimensions and column names as my input data.frame. I just want the columns of the data.frame() replaced according to the tsclean() function. So from the example my output would have the following form:

2to2  2to6  17to9  20to31  90to90
 .     .     .       .       .
 .     .     .       .       .
Rick Arko
  • 680
  • 1
  • 8
  • 27
  • 1
    http://stackoverflow.com/questions/12866189/calculating-the-outliers-in-r This may be of some use for you as well. Idea being is you create a function that intakes a dataframe, summarizes the dataframe by the finding the quantiles, upper and lower thresholds and filter the final dataset outside that range.. – InfiniteFlash Mar 08 '16 at 08:07

1 Answers1

2

Your problem is that you're trying to make every column a data frame when assigning it to the list. This is unnecessary. We can also avoid the initialize-to-list-and-cbind workflow by just overwriting the columns in the df object one at a time.

clean_ts <- function(df, frequency = 12, start = c(2014, 1), end = c(2015, 12)) {

  ts <- ts(df, frequency = frequency, start = start, end = end)

  for (i in 1:ncol(ts)) {
    df[, i] <- tsclean(ts[, i])
   }
  return(df)
}

Even cleaner, we can use lapply to hide the loop:

clean_ts <- function(df, frequency = 12, start = c(2014, 1), end = c(2015, 12)) {
  ts <- ts(df, frequency = frequency, start = start, end = end)
  return(as.data.frame(lapply, ts, tsclean)))
}
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294