Recently the R ecosystem has been enriched by a number of packages that implement a variety of algorithms for outlier detection, both for univariate and multivariate data. Therefore, detecting outliers in a data set is relatively straightforward. My problem is as follows: I have two data sets (data frames): one is the reference, the other contains the values of interest. I am interested in evaluating whether each data point in the second data set (i.e. each row) is an outlier when compared with the reference data set.
In theory, I think that my approach should be the following: take the first row of the second data set and add it (e.g. row bind) to the first. Compute outlier scores (e.g. with DDoutlier package), sort then and look to see if the newly added row is among the highest scores. Then in an iterative way do the same for the second, third, fourth rows etc, up to the last row of the second data frame. This would allow me to identify which values of the second data set are outliers when compared with the first.
My question is: how can I do this efficiently in R? I thought of using a for loop, but I am aware that for loops are not the most efficient approach. In the absence of an alternative I would consider using a for loop, but when trying to write one, I get an error, because I do something wrong, but I do not understand how to correct it.
X <- iris[,1:4]
X
# Let's assume that the first 50 rows (X[1:50, 1:4]) is the reference, and the rest (X[51:150, 1:4]) is the test data.
library(DDoutlier)
outlier_score <- list()
for (i in seq_along(1:(nrow(X)-50))){
newdf <- X[c(1:50, 50+i), ]
outlier_score[[i]]<- COF(newdf, k=5)
}
Trying to implement this for loop, I get the following error:
Error in distMatrix[SBNpath, SBNpath] : subscript out of bounds