0

Recently the R ecosystem has been enriched by a number of packages that implement a variety of algorithms for outlier detection, both for univariate and multivariate data. Therefore, detecting outliers in a data set is relatively straightforward. My problem is as follows: I have two data sets (data frames): one is the reference, the other contains the values of interest. I am interested in evaluating whether each data point in the second data set (i.e. each row) is an outlier when compared with the reference data set.

In theory, I think that my approach should be the following: take the first row of the second data set and add it (e.g. row bind) to the first. Compute outlier scores (e.g. with DDoutlier package), sort then and look to see if the newly added row is among the highest scores. Then in an iterative way do the same for the second, third, fourth rows etc, up to the last row of the second data frame. This would allow me to identify which values of the second data set are outliers when compared with the first.

My question is: how can I do this efficiently in R? I thought of using a for loop, but I am aware that for loops are not the most efficient approach. In the absence of an alternative I would consider using a for loop, but when trying to write one, I get an error, because I do something wrong, but I do not understand how to correct it.

X <- iris[,1:4]
X

# Let's assume that the first 50 rows (X[1:50, 1:4]) is the reference, and the rest (X[51:150, 1:4]) is the test data. 


library(DDoutlier)
outlier_score <- list()
for (i in seq_along(1:(nrow(X)-50))){
  newdf <- X[c(1:50, 50+i), ]
  outlier_score[[i]]<- COF(newdf, k=5)
}

Trying to implement this for loop, I get the following error:

Error in distMatrix[SBNpath, SBNpath] : subscript out of bounds

RAN
  • 85
  • 8
  • I've just discovered that if I delete rownames of newdf (with 'rownames(newdf) <- c()'), the for loop written above works. My question remains, then: how to compute more efficiently the outliers for the second data set? – RAN Jul 11 '19 at 04:58
  • 1
    I don't know if what you are asking is possible, the efficiency depends on the package's code, not on your code. Also, your `seq_along` is a bit messy, use `seq.int(nrow(X) - 50)` instead, it's much more readable. – Rui Barradas Jul 11 '19 at 07:06
  • Thank you for your reply. I definitely can use the for loop (I tested it), but leaving aside the intrinsic limitations of the package's code, I am wondering whether there is a better alternative (in R the use of for loops is usually discouraged for the apply family or other optimized solutions). – RAN Jul 11 '19 at 18:24
  • 1
    1) [`apply` is not faster than a `for` loop](https://stackoverflow.com/questions/42393658/lapply-vs-for-loop-performance-r). 2) Your code is simple and readable. Why change that part? I would suggest that you change the way the list is created, `outlier_score <- vector("list", length = nrow(X) - 50)` will allocate space just once whereas to extend it inside a loop forces R to call the memory management routines over and over again. – Rui Barradas Jul 11 '19 at 18:57

0 Answers0